Timon Harz

December 12, 2024

How Large Language Models work

Discover how LLMs like ChatGPT are revolutionizing human-computer interaction with their ability to understand and generate language. This article explains their inner workings in an easy-to-follow format.

Large Language Models (LLMs) have brought Artificial Intelligence into the spotlight, captivating nearly everyone with their remarkable ability to understand and generate human language. Among the most well-known LLMs is ChatGPT, which has gained widespread popularity thanks to its ability to interact with users in a way that feels incredibly natural. The magic of this breakthrough lies in the fact that natural language, something we all use every day, is now the interface that connects us with some of the most sophisticated AI technologies ever created.

But despite all the buzz, many people still don’t fully understand how LLMs work. For most, this topic is shrouded in mystery, and unless you're a data scientist or have a background in AI, it can feel like a complex world that's hard to crack. In this article, my goal is to change that and give you a clear, approachable guide to how LLMs work—without overwhelming you with jargon or leaving you more confused than before.

I won’t pretend this is a simple task. The LLMs we have today are the result of decades of intense research in the field of AI, and they are incredibly sophisticated. Unfortunately, much of the existing literature on LLMs falls into one of two extremes: either it's too technical, assuming you already know a lot about the topic, or it oversimplifies things to the point where you don’t actually learn anything new.

So, here’s my promise: This article will find the sweet spot between those extremes. It’s not just a superficial overview, and it’s not an academic deep dive either. Instead, we’ll take a journey together, starting from square one, and work our way up to understanding exactly how LLMs are trained and why they work so well. Along the way, I’ll break things down into bite-sized concepts, relying on intuition and visuals rather than complex math. The underlying principles are surprisingly intuitive, and once you grasp them, you’ll have a much clearer picture of how these AI models function.

This article also serves as a practical guide for getting the most out of using LLMs like ChatGPT. We’ll cover some useful tips and tricks that can help you frame your queries in a way that gets you better, more useful responses. As AI researcher Andrei Karpathy famously put it: “English is the hottest new programming language.” By the end of this, you'll be much more adept at interacting with LLMs, understanding their strengths, and knowing how to maximize their potential.

But before we dive into all that, let’s first take a step back and explore where LLMs fit into the larger landscape of Artificial Intelligence. Understanding their role in the broader AI ecosystem will give us the context we need to appreciate their true power.

Artificial Intelligence (AI) is a broad field, often imagined as a series of layers, each more specialized than the last. At its core, AI involves creating machines that can perform tasks that typically require human intelligence.

Machine Learning (ML) is a subset of AI that focuses on enabling machines to recognize patterns in data. Once a machine identifies a pattern, it can apply that knowledge to make predictions or decisions about new, unseen data. This is the foundational concept of ML, and we’ll dive into it in just a moment.

Within ML, there’s a more specialized area called Deep Learning, which deals with unstructured data like text, images, and audio. Deep Learning relies on artificial neural networks, which are inspired by the structure and function of the human brain, allowing machines to process complex data in ways that mimic human cognition.

Large Language Models (LLMs), the focus of this article, are a type of deep learning model specifically designed to work with text. These models have revolutionized the way we interact with machines, as they can generate and understand human language with impressive accuracy.

Throughout this article, we’ll explore how each of these layers contributes to the development of LLMs, but we’ll skip over the broader AI layer since it’s too general for our purposes. Instead, we’ll jump straight into Machine Learning, where the real magic happens.

The goal of Machine Learning is to uncover patterns within data, specifically patterns that reveal the relationship between an input and an outcome. To make this clearer, let's look at an example.

Imagine we want to distinguish between two of my favorite music genres: reggaeton and R&B. If you're not familiar with these genres, here’s a quick overview to help frame the task. Reggaeton is a Latin urban genre known for its energetic beats and infectious rhythms that get people moving, while R&B (Rhythm and Blues) is a genre with deep roots in African-American musical traditions, characterized by soulful vocals and a blend of both upbeat and slower, emotional songs.

Let’s say we have 20 songs, each with two measurable metrics: tempo and energy. Additionally, each song is labeled with a genre, either reggaeton or R&B. When we plot these songs on a graph with tempo on one axis and energy on the other, we notice a clear pattern: high-energy, fast-tempo songs tend to be reggaeton, while slower, lower-energy songs are mostly R&B. This makes sense based on what we know about the genres.

However, manually labeling every song with a genre is time-consuming and not scalable. Instead, we want to learn the relationship between the song metrics (tempo, energy) and their genre, so we can predict the genre of new songs using just these easily accessible metrics.

In Machine Learning terms, this is a classification problem, where the goal is to predict a category (in this case, reggaeton or R&B). This contrasts with a regression problem, where the outcome is a continuous value, like temperature or distance.

To solve this, we “train” a Machine Learning model, or classifier, using our labeled dataset — the songs with known genres. During training, the model tries to find the line that best separates the two genres based on tempo and energy.

So how does this help? Now that we’ve trained the model and found this line, we can use it to predict the genre of any new song. By comparing its tempo and energy to the line, the model can determine whether the song is likely to be reggaeton or R&B. We no longer need a human to label each song, making the process much faster and more scalable.

Furthermore, the further a song is from the line, the more confident the model is in its prediction. For example, if a new song has low energy and tempo, and it falls far to the R&B side of the line, the model might be 98% certain it’s an R&B song, with just a 2% chance it’s reggaeton. This confidence gives us additional insight into the reliability of each prediction.

Of course, things are rarely as simple as they seem. In reality, the boundary that separates the classes may not be a straight line. The relationship between the inputs and the outcome could be far more complex. It might be curved, like in the example above, or even more intricate than that.

Moreover, real-world problems tend to be more complicated in another way. Instead of just two input variables like in our example, we often deal with tens, hundreds, or even thousands of input features. On top of that, we usually have more than two classes to predict. These classes can each be influenced by all the inputs in complex, non-linear ways.

Even in our relatively simple example, we know there are more than just reggaeton and R&B genres out there, and we would need more than just tempo and energy to classify songs accurately. The relationship between these additional factors is likely to be much more intricate.

The key takeaway here is that the more complex the relationship between the inputs and the output, the more sophisticated the Machine Learning model needs to be in order to capture it. Typically, this complexity grows as we increase the number of inputs and the number of classes.

Additionally, more data is crucial. The more complex the model and the relationships it needs to learn, the more data we need to train it effectively. You’ll see why this is so important shortly.

Let’s shift gears and tackle a slightly different problem, but one where we can apply the same mental model we discussed earlier. This time, our input is an image—say, a picture of a cute cat in a bag (because, let’s be honest, cat examples always make things more fun).

For the output, we have three possible labels: tiger, cat, and fox. To add some motivation, imagine we want to protect a herd of sheep, so we need to trigger an alarm if we spot a tiger, but not if it’s just a cat or a fox.

As with our previous example, this is still a classification task, since the output can only belong to one of a few fixed categories. Therefore, we could use labeled data (images with assigned labels) and train a Machine Learning model to make predictions.

But here’s the tricky part: how do we process an image as input? Unlike the numeric metrics we dealt with earlier, images are visual, and computers process data in numbers. Fortunately, images are just numbers too—they’re made up of pixels, each with a value that represents its color. An image has a height, width, and typically three color channels (red, green, and blue). So, in theory, we could feed these pixel values directly into a Machine Learning model.

However, two big challenges arise. First, even a relatively small image, say 224x224 pixels, contains over 150,000 values (224 x 224 x 3 for the three color channels). Remember, we were working with just a handful of variables before, but now we’re dealing with at least 150,000.

Second, the relationship between the raw pixel data and the class label is incredibly complex. While humans can easily distinguish between a tiger, a cat, and a fox just by looking at an image, if you were to examine the 150,000 individual pixels one by one, you wouldn’t have a clue what the image represents. This is how a Machine Learning model sees the image—it has to learn from scratch how to map those raw pixels to the correct label, which is no easy feat.

Let’s explore another complex input-output relationship: the connection between a sentence and its sentiment. Sentiment typically refers to the emotion conveyed by a sentence, often categorized as either positive or negative.

To set up the problem: the input here is a sequence of words (a sentence), and the outcome variable is the sentiment, which we’ll classify as either positive or negative. Like the image classification example, this is a classification task with two possible labels.

As humans, we can easily understand this relationship—we instinctively know when a sentence feels positive or negative. But the question is: can we teach a Machine Learning model to understand it in the same way?

Before we dive into that, there’s a key hurdle to address: how do we convert words into numeric inputs for a Machine Learning model? This is a more complicated problem than working with images, which are essentially already numeric. Words, however, aren’t inherently numerical. We won’t go into all the technical details here, but the important concept to understand is that we can transform each word into something called a "word embedding."

A word embedding is a numerical representation of a word’s meaning—its semantics and syntax—within a specific context. These embeddings are typically generated during training or through a separate pre-training process. Depending on the model, word embeddings can consist of anywhere from tens to thousands of variables per word.

To summarize, we can take a sentence and convert it into a sequence of numeric inputs (word embeddings), which capture the word's meaning. These inputs are then fed into a Machine Learning model for classification.

Now, we face a challenge similar to the one we encountered with images: the complexity of handling long sentences. A single long sentence (or a paragraph, or even an entire document) can result in a large number of inputs due to the size of the word embeddings, which makes the task more computationally demanding.

The second challenge lies in the complexity of the relationship between language and sentiment. Take a sentence like "That was a great fall" — the interpretation could vary widely, especially when sarcasm is involved.

To tackle this, we need a highly advanced Machine Learning model, along with vast amounts of data. This is where Deep Learning becomes essential.

We’ve already made significant progress in understanding LLMs by covering the basics of Machine Learning and the reasons why more powerful models are necessary. Now, let’s take another major step by introducing Deep Learning.

As we discussed earlier, when the relationship between inputs and outputs is highly complex, or when there are a large number of variables (which is true for both our image and sentiment examples), traditional models like linear ones just won’t cut it. For these tasks, we need models that are far more flexible and powerful.

This is where neural networks come into play.

Neural networks are powerful Machine Learning models that can capture highly complex relationships. They are the driving force behind learning these intricate patterns at massive scales.

While loosely inspired by the human brain, the similarities are often debated. At their core, neural networks have a simple structure: a series of layers of connected "neurons" that process an input signal to predict an output. Think of them as multiple layers of linear regression stacked together, with non-linearities added in between to enable the network to model intricate, non-linear relationships.

Neural networks are typically deep, meaning they have many layers, which is why the term “Deep Learning” is used. This depth allows them to handle enormous complexity. For example, ChatGPT is based on a neural network with 176 billion neurons — more than the 100 billion neurons in the human brain.

From here on, we’ll assume a neural network as our Machine Learning model, integrating our understanding of processing images and text.

Now, let's dive into Large Language Models (LLMs), and this is where things get really exciting. By this point, you should have all the foundational knowledge to understand what LLMs are.

To start, let's break down what "Large" means in this context. It's all about the number of neurons, or parameters, in the neural network. While there’s no exact threshold for what defines a Large Language Model, anything with over 1 billion neurons is generally considered large.

With that out of the way, let’s move on to what a "language model" is. And just so you know, we’ll soon get into what the "GPT" in ChatGPT stands for. But let’s take it step by step.

Let’s frame this idea as a Machine Learning problem: How can we predict the next word in a sequence, whether it’s a sentence or a paragraph? Essentially, we want to learn to predict the next word at any given point. Based on what we've learned earlier in this article, we can easily frame this as a Machine Learning task, which is similar to the sentiment classification we discussed before.

In this case, the input to the neural network is a sequence of words, and the output is the next word in that sequence. This is still a classification problem, but instead of just a few classes, we now have as many classes as there are words—roughly 50,000. This is the essence of language modeling: predicting the next word.

Clearly, this task is much more complex than the binary sentiment classification, but now that we understand the power of neural networks, the real question is: why not?

A quick disclaimer: We’re simplifying a lot of concepts here (as we have throughout the article). In reality, things are more complex, but focusing on the core mechanics helps us grasp the main ideas without getting bogged down in the finer details.

Now that we understand the task, we need data to train the neural network. Fortunately, creating a vast amount of data for our "next word prediction" task is not difficult. The internet, books, research papers, and more provide an abundance of text that we can use to build a massive dataset. The beauty of this is that we don’t need to manually label the data — the next word in each sequence acts as the label, which is why this approach is called self-supervised learning.

As shown in the image above, a single sequence can be transformed into multiple training sequences. And we have plenty of them, ranging from short phrases to long passages (some stretching over thousands of words), so the model learns to predict the next word in various contexts.

In summary, we’re training a neural network (the LLM) to predict the next word in any sequence of words, whether it’s in English, German, or any other language, or even if it’s a tweet, mathematical formula, poem, or code snippet. All of these types of sequences are included in the training data.

With a sufficiently large neural network and enough data, the LLM becomes excellent at predicting the next word. Will it always be perfect? No — there are often multiple words that could follow a sequence. But it will get good at choosing the most syntactically and semantically appropriate one.

Now that we can predict one word, we can feed the extended sequence back into the LLM to predict the next word, and continue this process. In other words, with our trained LLM, we can generate entire text passages, not just a single word. This is why LLMs are considered a form of Generative AI — we’ve essentially taught the LLM to "speak" one word at a time.

There’s another important detail to understand here. We don’t always have to predict the most likely word. Instead, we can sample from, say, the five most likely words at any given point. This introduces an element of creativity to the LLM’s output. Some LLMs even allow you to control how deterministic or creative you want the results to be. This is why, in ChatGPT, which uses such a sampling strategy, you typically get different responses each time you regenerate an answer.

Now, speaking of ChatGPT, you might wonder why it’s not called ChatLLM. Well, language modeling is only the starting point — it’s far from the whole story. So, what does the “GPT” in ChatGPT stand for?

We've already learned what the "G" stands for — "generative," meaning the model was trained on a language generation task, which we've discussed. But what about the "P" and the "T"?

Let's quickly touch on the "T," which stands for "transformer." No, not the movie character, but a type of neural network architecture. This may not be crucial for now, but if you're curious, the main advantage of the transformer is that it can focus its attention on the most relevant parts of the input sequence at any given moment. It's somewhat similar to how humans focus on what's most important for a task and ignore the rest.

Now, let's dive into the "P," which stands for "pre-training." This is where things get interesting. We’ll explain why, with Large Language Models like ChatGPT, we talk about pre-training instead of just regular training. The reason is that these models are trained in phases.

Pre-training

The first phase is pre-training, which is essentially what we’ve covered so far. During this stage, the model is trained on massive amounts of data to predict the next word. Through this, it not only masters the grammar and syntax of language but also learns a great deal about the world and even develops some emerging abilities that we'll explore later.

Now, I have a few questions for you: What might be the issue with this type of pre-training? There are certainly several potential problems, but one key issue is what the LLM has actually learned.

What it’s really learned is how to ramble about a topic. It may do an excellent job at that, but it hasn't learned how to respond to the types of inputs you’d typically want to give to an AI, like questions or instructions. The result is that the model doesn’t behave like an assistant.

For example, if you ask a pre-trained LLM, "What is your first name?" it might reply with "What is your last name?" simply because it's been trained on data like forms, where such sequences often appear. It's only trying to complete the input sequence, not engage in a meaningful conversation.

It also struggles with following instructions because that type of language structure—an instruction followed by a response—isn't commonly seen in the training data. Perhaps platforms like Quora or StackOverflow are the closest examples of this kind of structure.

At this point, we can say that the LLM isn't aligned with human intentions. Alignment is a crucial topic for LLMs, and the good news is that we can fix this issue to a large extent. It turns out that even though pre-trained LLMs don't initially respond well to instructions, they can be trained to do so.

Instruction Fine-tuning and RLHF

This is where instruction fine-tuning comes in. We take the pre-trained LLM, which has already learned some language abilities, and continue the training process—similar to before, where the model learns to predict one word at a time. However, this time, we use high-quality instruction-response pairs as our training data.

This approach helps the model unlearn its tendency to merely complete text and instead teaches it to become a helpful assistant that follows instructions and responds according to the user’s intent. The instruction dataset is usually much smaller than the one used in pre-training, as high-quality instruction-response pairs are more expensive to create, typically requiring human involvement. This contrasts with the self-supervised labels used in pre-training, which are cheaper to gather. This phase is often referred to as supervised instruction fine-tuning.

Additionally, some LLMs, like ChatGPT, undergo a third stage known as reinforcement learning from human feedback (RLHF). While we won’t dive into the details here, its goal is similar to instruction fine-tuning. RLHF further improves the model’s alignment, ensuring its output better reflects human values and preferences. Early research suggests this stage is crucial for reaching or exceeding human-level performance. Combining reinforcement learning with language modeling shows great potential and is expected to lead to significant advancements in LLM capabilities.

Now that we've covered the process, let's explore how this understanding applies to some common use cases.

Why Can an LLM Perform Summarization?

One of the impressive capabilities of an LLM is its ability to summarize longer pieces of text. If you’ve ever pasted a document and asked it to summarize, you’ll know it does an excellent job. So why is it so effective?

The key lies in the training data. Summarization happens frequently across many types of content—whether it's articles, research papers, or books. As a result, an LLM trained on this data learns how to perform this task as well. It learns to identify the main points and condense them into a concise summary.

When generating a summary, the full text is included as part of the input sequence. This is similar to how a research paper often has a conclusion that follows the main content. It’s likely that summarization was a skill the model learned during pre-training, though instruction fine-tuning probably enhanced this ability. We can assume that summarization examples were part of this phase too.

Why Can an LLM Answer Common Knowledge Questions?

An LLM’s ability to answer common knowledge questions stems from instruction fine-tuning and RLHF, which teach the model to respond appropriately to instructions. However, the majority of the knowledge needed to answer these questions is acquired during pre-training.

But here’s a crucial follow-up question: What if the LLM doesn’t know the answer? Unfortunately, in such cases, the model might just generate a response that sounds plausible, even if it’s incorrect. To understand why this happens, we need to consider both the training data and the objective behind the training process.

You may have come across the term "hallucination" when discussing LLMs, which refers to instances where the model fabricates facts or information that aren't true.

So, why does this happen? The key issue is that LLMs are trained to generate text, not necessarily factually accurate text. There is no direct mechanism in the training process that teaches the model to differentiate between true and false information. However, that’s not the primary concern. In reality, much of the text on the internet and in books is written confidently, so the LLM learns to generate responses with similar confidence—even when it’s wrong. This results in a model that often lacks an understanding of uncertainty.

That said, this is an active research area, and we can expect that LLMs will become less prone to hallucinations over time. For example, instruction tuning can help train the model to avoid generating incorrect information, though it remains to be seen if we can fully eliminate this issue.

In fact, we can begin to address this problem right now. Armed with the knowledge we’ve gathered, we already have strategies in place that help mitigate hallucinations and are widely used today.

Imagine asking an LLM, "Who is the current president of Colombia?" There's a good chance it could provide the wrong name. This could happen for two reasons:

First, as we've discussed, the LLM might simply hallucinate and generate an incorrect or entirely fabricated name.

Second, and briefly mentioned, LLMs are trained on data up to a certain cut-off date—often as recent as the previous year. As a result, the LLM can't be sure of the current president because things may have changed since its training data was collected.

So, how can we address both of these issues? The solution lies in providing the model with relevant context. The idea is that anything included in the LLM's input sequence is directly available for processing, while the model's implicit knowledge from pre-training is harder to retrieve and can be less reliable.

For example, if we include the Wikipedia article on Colombia’s political history in the LLM's context, it’s much more likely to give the correct answer, as it can extract the president's name directly from the updated information.

This method is known as "grounding" the LLM in context, or grounding it in the real world, as opposed to letting it generate answers independently.

This is how search-based LLMs like Bing Chat operate. They first retrieve relevant context from the web through a search engine, then pass that information to the LLM along with the user's query. The process is visualized in the image above.

By now, you likely have a solid understanding of how state-of-the-art LLMs work (at least as of the second half of 2023). You might be thinking, "This isn't that magical—it's just predicting one word at a time. It's all based on statistics, after all." But is it really that simple?

Let’s take a step back. The truly remarkable part of all this is how incredibly well it works. In fact, even the researchers at OpenAI were amazed by how far language modeling has come. One of the key factors in the recent advancements has been the massive scaling up of neural networks and datasets, which has led to significant improvements in performance. For instance, GPT-4, with over one trillion parameters, can pass the bar exam or AP Biology with scores in the top 10% of test takers.

Even more surprising is that large LLMs exhibit emerging abilities—skills to tackle tasks or perform actions they weren't explicitly trained to handle.

In the final section of this article, we'll explore some of these emerging abilities and I'll show you how to leverage them to solve problems.

A fascinating emerging ability of LLMs is what’s known as zero-shot learning. This means they can perform entirely new tasks they haven’t been explicitly trained on, simply by receiving instructions on how to approach the task.

To illustrate this with a fun example, you can ask an LLM to translate a sentence from German to English using only words that start with the letter "f."

For instance, when tasked with translating "Die Katze schläft gerne in der Box" (which means "The cat likes to sleep in the box"), the LLM came up with "Feline friend finds fluffy fortress," which is a surprisingly creative translation!

For more complex tasks, you might quickly find that zero-shot prompting requires very specific instructions, and even then, the performance is often less than ideal.

This is similar to how humans approach new tasks—if you're asked to do something unfamiliar, you'd likely request examples or demonstrations to better understand how to proceed. LLMs benefit from the same approach.

For instance, if you want a model to convert different currency amounts into a standardized format, you could provide a detailed description or offer a brief instruction along with some example demonstrations. The image above shows an example task.

Using this prompt, the model should easily handle the final example, "Steak: 24.99 USD," and respond with $24.99. Notice that the solution to the last example is deliberately omitted. Since LLMs are essentially text-completers, it's important to maintain a consistent structure in your prompt. You should essentially guide the model to provide exactly what you want, just as shown in the example.

To wrap up, a useful tip is to provide examples if the LLM struggles with a zero-shot task. This often helps the model better understand the task, improving both performance and reliability.

Another fascinating ability of LLMs mirrors human intelligence, especially when dealing with complex tasks that require multiple steps of reasoning.

For example, if I asked, "Who won the World Cup in the year before Lionel Messi was born?" what would you do? You'd likely break it down step by step, solving intermediate questions along the way to arrive at the correct answer. This is exactly what LLMs can do as well.

Research shows that instructing an LLM to "think step by step" can significantly improve its performance on many tasks.

Why does this work? While the information needed to answer is accessible, this composite knowledge may not be readily stored in the model's internal memory. However, individual facts—like Messi's birthday or the winners of past World Cups—are likely included in the model's training data.

Allowing the LLM to build up to the final answer helps because it gives the model a chance to "think out loud," using a kind of working memory to solve the smaller sub-problems before arriving at the correct solution.

The key here is understanding that everything in the input sequence before a word is generated provides context the model can use. So, as shown in the example above, by the time the model outputs "Argentina," it has already processed Messi's birthday and the relevant World Cup year, making it easier to provide the correct response.

Conclusion

As we wrap things up, I want to revisit a question posed earlier: Is the LLM truly just predicting the next word, or is there something more to it? Some researchers argue that it’s the latter, suggesting that to excel at next-word prediction in any context, the LLM must have developed an internal, compressed understanding of the world. Others contend that the model simply memorizes and replicates patterns from its training data, without any true comprehension of language, the world, or anything else.

At this stage, it’s hard to say which view is definitively correct; perhaps it’s just a matter of perspective. What’s clear is that LLMs are proving incredibly useful, demonstrating impressive knowledge and reasoning abilities, and maybe even hinting at sparks of general intelligence. Whether, and to what extent, this resembles human intelligence remains an open question, as does the potential for further breakthroughs in language modeling.

I hope this article has helped you better understand LLMs and the excitement surrounding them, so that you can form your own opinion on AI’s potential and risks. The future of AI shouldn't be determined solely by researchers and data scientists; everyone should have a voice in how it’s used to benefit the world. That’s why I aimed to write this article in an accessible way, with minimal background knowledge required.

If you’ve made it through this article, you now have a solid understanding of how state-of-the-art LLMs work (as of Fall 2023) at a high level.

I’ll leave you with some final thoughts on the current state of Artificial Intelligence and LLMs.