Timon Harz

December 5, 2024

Exploring Architectures and Configurations for Large Language Models (LLMs)

Large Language Models (LLMs) like GPT-4 excel in NLP tasks through advanced architectures, including encoder-decoder, causal decoder, and prefix decoder. This article delves into their configurations, activation functions, and training stability for optimal performance.

Exploring Architectures and Configurations for Large Language Models (LLMs)

Introduction

In recent years, language models, particularly large language models (LLMs) like GPT-4, have achieved remarkable success across a range of natural language processing (NLP) tasks. These models excel in activities such as text generation, language translation, and question-answering, showcasing their powerful capabilities. This success is largely due to their ability to learn from vast amounts of text data, combined with advanced architectures and training methods.

LLMs have also unlocked new opportunities for various applications in artificial intelligence (AI) and NLP, improving systems like chatbots, automated content generation, and voice assistants. In this blog, we’ll explore the architecture behind LLMs, focusing on the widely adopted Transformer architecture, pre-training objectives, and model configurations. The Transformer’s parallelizability and scalability have made it the go-to design for LLMs, enabling them to scale to billions or even trillions of parameters. LLMs can generally be categorized into three main types: encoder-decoder, causal decoder, and prefix decoder.

General Architecture

As discussed above, the existing LLMs can be broadly classified into 3 types: encoder-decoder, causal decoder, and prefix decoder.

Figure 1: Examining the attention patterns of three prominent architectures reveals distinct differences. The attention between prefix tokens, attention between prefix and target tokens, attention between target tokens, and masked attention are represented by rounded rectangles in blue, green, yellow, and grey colors, respectively.

Encoder-Decoder Architecture

The encoder-decoder architecture, based on the original Transformer model, consists of two primary components: an encoder and a decoder, each built from stacked Transformer blocks. The encoder processes the input sequence with multi-head self-attention layers to generate latent representations, while the decoder applies cross-attention to these representations to produce the target sequence. Pretrained language models (PLMs) like T5 and BART, which utilize this architecture, have proven effective for various NLP tasks. However, only a few large language models (LLMs), such as Flan-T5, are built using the encoder-decoder structure.

Causal Decoder Architecture

The causal decoder architecture employs a unidirectional attention mask, allowing each token to attend only to previous tokens and itself. Both input and output tokens are handled in the same way within the decoder. The GPT-series models (GPT-1, GPT-2, and GPT-3) are prime examples of LLMs built on this architecture. GPT-3, in particular, has demonstrated impressive in-context learning abilities. Many other LLMs, including OPT, BLOOM, and Gopher, have adopted causal decoders as well.

Prefix Decoder Architecture

Also known as the non-causal decoder, the prefix decoder architecture modifies the causal decoder’s masking mechanism to allow bidirectional attention on prefix tokens, while maintaining unidirectional attention on generated tokens. Similar to the encoder-decoder structure, prefix decoders encode the prefix sequence bidirectionally and predict output tokens autoregressively with shared parameters. A common approach is to train causal decoders and then convert them into prefix decoders to speed up convergence. LLMs like GLM130B and U-PaLM are built with prefix decoders.

All three architectures can be enhanced using the mixture-of-experts (MoE) technique, where a subset of neural network weights is activated for each input. This method, used in models like Switch Transformer and GLaM, has led to notable performance improvements by scaling the number of experts or total parameters.

Detailed Configurations

Since the introduction of the Transformer model, researchers have made numerous advancements to improve training stability, performance, and computational efficiency. In this section, we will explore configurations related to four key Transformer components: normalization, position embeddings, activation functions, and attention and bias mechanisms.

Normalization

To address training instability in pre-training large language models (LLMs), layer normalization (LN) has become a standard technique in Transformer architectures. The placement of LN plays a key role in model performance. While the original Transformer used post-LN, most modern LLMs opt for pre-LN, as it tends to provide more stable training, despite a slight reduction in performance.

To further improve stability, researchers introduced an additional layer of LN, known as Sandwich-LN, positioned before residual connections to prevent value explosion. However, in practice, Sandwich-LN has not always been effective at stabilizing LLM training and may even lead to training collapse in some cases.

Alternative normalization methods have been explored in models like Gopher and Chinchilla, which use RMS Norm for faster training and better performance. GLM-130B, on the other hand, incorporates DeepNorm, a post-normalization technique that offers enhanced training stability. Adding an extra LN layer after the embedding layer has been found to stabilize training, but it often results in a notable performance drop, making it less common in recent LLMs.

Activation Functions

The choice of activation function is critical for optimal performance in feed-forward networks. GeLU activations are widely used in existing LLMs. More recently, models like PaLM and LaMDA have adopted variations of GLU activation, particularly SwiGLU and GeGLU, which have shown improved performance in practical applications.

Figure 2: Activation Function Comparison

However, these variants introduce additional parameters—about 50% more than GeLU activations—in the feed-forward networks.

Positional Embeddings

Positional embeddings are essential in Transformers to include position information in sequence modeling, as the self-attention mechanism is permutation equivariant. The original Transformer uses two types of absolute position embeddings: sinusoidal and learned position embeddings. In most LLMs, learned position embeddings are more commonly used.

Relative positional encodings, on the other hand, generate embeddings based on the offsets between keys and queries. This approach allows the model to perform well on longer sequences, including those longer than those seen during training, enabling effective extrapolation.

ALiBi (Attention with Linear Biases) introduces a penalty that adjusts attention scores based on the distance between keys and queries, enhancing zero-shot generalization and extrapolation capabilities compared to other position embeddings.

RoPE (Rotary Positional Embeddings) employs rotating matrices based on absolute positions to compute attention scores between keys and queries, incorporating relative position information for better modeling of long sequences. Given its advantages, RoPE has been widely adopted in recent LLMs.

Figure 3: Positional Embedding

Attention and Bias

In addition to the standard self-attention mechanism in the original Transformer, GPT-3 incorporates sparse attention, specifically Factorized Attention, to reduce computational complexity. Researchers have also explored different strategies to efficiently handle longer sequences, such as implementing specialized attention patterns or optimizing GPU memory access, as seen in models like FlashAttention.

While biases are generally included in each dense kernel and Layer Norm in the original Transformer, recent LLMs, including PaLM and Galactica, have opted to remove them. This removal has been shown to improve training stability in LLMs, as demonstrated in various studies.

Detailed formulations for the network configurations.

Figure 4: Detailed formulations for the network configurations.

Summary of Configuration Recommendations

Existing literature suggests the following for optimizing Language Models (LLMs): to enhance generalization and training stability, it is recommended to use pre-RMS Norm for layer normalization and SwiGLU or GeGLU as activation functions. Additionally, placing LN immediately after embedding layers should be avoided, as it may result in performance degradation. For position embeddings, RoPE or ALiBi are preferred, as they excel with long sequences.

Conclusion

Large language models (LLMs) have revolutionized the way we approach natural language tasks, thanks to advancements in their design and training methods. Key improvements in model architecture, attention mechanisms, and training efficiency are paving the way for the future of AI.

What do you think will be the next big breakthrough in LLMs? As these models continue to evolve, challenges like handling long sequences and improving efficiency will remain crucial.

For more detailed articles and the latest insights on large language models, explore our LLM Blog Series.

Frequently Asked Questions (FAQ)

1. What are large language models (LLMs)?

Large language models (LLMs) are powerful artificial intelligence models that excel in natural language processing tasks. They are trained on vast amounts of text data and have shown remarkable capabilities in tasks like text generation, language translation, and question-answering.

2. How do LLMs contribute to advancements in AI and NLP?

LLMs have opened up new possibilities in AI and NLP applications. They have improved chatbots, automated content generation, and voice assistants. LLMs have also enhanced various NLP tasks, enabling more accurate language understanding and generation.

3. What are the different types of LLM architectures?

LLM architectures can be broadly classified into three types: encoder-decoder, causal decoder, and prefix decoder. Each type has its own advantages and has been used in different LLM models for specific purposes.

4. Which activation functions are commonly used in LLMs?

GeLU (Gaussian Error Linear Unit) activations are commonly used in existing LLMs. However, recent models like PaLM and LaMDA have employed variants of GLU (Gated Linear Unit) activation, such as SwiGLU and GeGLU, which have shown better performance in practical applications.

Press contact

Timon Harz

oneboardhq@outlook.com