Timon Harz

December 17, 2024

How do Large Language Models work?

Decoding the Hidden Machinery of ChatGPT: How Artificial Intelligence Learns to Understand Language. Unravel the complex world of large language models and discover how machines transform billions of words into seemingly intelligent communication.

When ChatGPT was launched last fall, it quickly made waves across both the tech industry and the broader public. While machine learning researchers had been exploring large language models (LLMs) for some time, most people were unaware of their true potential.

Today, LLMs are well-known, and millions have experimented with them. However, few truly understand how they function.

If you're familiar with LLMs, you’ve likely heard that they are trained to “predict the next word” and require vast amounts of text to do so. But that’s often where the explanation ends. The intricacies of how they make these predictions are still shrouded in mystery.

One reason for this is the unconventional way these models are created. Unlike traditional software, which is developed through explicit, step-by-step instructions from human programmers, ChatGPT is based on a neural network that learns by processing billions of words of regular language.

As a result, no one fully comprehends the inner workings of LLMs. While researchers are making strides, gaining a complete understanding will take years, possibly decades.

However, there is much that experts have uncovered about how these systems function. This article aims to break down the core concepts of LLMs and explain them in an accessible way—without relying on complex jargon or advanced math.

We'll begin by exploring word vectors, the unexpected method language models use to represent and process language. Next, we'll take a deep dive into the transformer, the foundational component behind systems like ChatGPT. Lastly, we'll cover how these models are trained and why achieving strong performance demands such vast amounts of data.

Word Vectors

To understand how language models function, it’s crucial to grasp how they represent words. While humans use sequences of letters—like C-A-T for "cat"—language models represent words with a series of numbers known as a "word vector." For instance, here's a sample vector representation of "cat":

[0.0074, 0.0030, -0.0105, 0.0742, 0.0765, -0.0011, 0.0265, 0.0106, 0.0191, 0.0038, -0.0468, -0.0212, 0.0091, 0.0030, -0.0563, -0.0396, -0.0998, -0.0796, …, 0.0002]

(For the full vector, click here and select “show the raw vector.”)

Why use such a complex notation? Think of it like geographical coordinates. Washington, DC, is located at 38.9 degrees north and 77 degrees west, which we can represent as a vector:

Washington, DC: [38.9, 77]
New York: [40.7, 74]
London: [51.5, 0.1]
Paris: [48.9, -2.4]

This system helps us reason about spatial relationships: New York is close to Washington, DC, because their coordinates are similar. Similarly, Paris is close to London but far from Washington, DC.

In the same way, language models treat word vectors as points in an abstract “word space.” Words with similar meanings are positioned closer together in this space. For example, the words closest to "cat" in vector space might include "dog," "kitten," and "pet."

The advantage of using vectors, composed of real numbers, over simple letter strings (like C-A-T) is that numbers allow for mathematical operations that letters cannot. Words are too intricate to be represented in just two dimensions, so language models use vector spaces with hundreds or even thousands of dimensions. While humans can’t easily visualize such high-dimensional spaces, computers excel at analyzing them and can generate meaningful relationships based on these complex representations.

Researchers have been exploring word vectors for decades, but the idea truly gained traction when Google introduced its word2vec project in 2013. Google analyzed millions of documents from Google News to identify patterns in which words tend to appear in similar contexts. Over time, a neural network trained to predict word pairings learned to position words with similar meanings (such as "dog" and "cat") close to one another in vector space.

What made Google’s word vectors particularly fascinating was their ability to enable reasoning through vector arithmetic. For instance, researchers at Google took the vector for "biggest," subtracted the vector for "big," and added the vector for "small." The word closest to the resulting vector? "Smallest." This capability revealed that word vectors could capture relationships between words in a way that was both mathematical and meaningful.

You can also use vector arithmetic to draw analogies! For example, just as "big" relates to "biggest," "small" relates to "smallest." Google’s word vectors captured many other types of relationships as well:

Swiss is to Switzerland as Cambodian is to Cambodia (nationalities)
Paris is to France as Berlin is to Germany (capitals)
Unethical is to ethical as possibly is to impossibly (opposites)
Mouse is to mice as dollar is to dollars (plurals)
Man is to woman as king is to queen (gender roles)

Since these vectors are derived from how people actually use language, they inevitably reflect many of the biases present in human communication. For example, in some word vector models, "doctor minus man plus woman" leads to "nurse." Addressing such biases is an ongoing area of research.

Despite this, word vectors remain an essential component for language models because they capture subtle yet significant information about the relationships between words. If a language model learns something about a cat (such as it often visits the vet), it’s likely the same will apply to a kitten or a dog. Similarly, if a model understands the connection between Paris and France (such as sharing a language), it can infer the same for Berlin and Germany, or Rome and Italy.

Word meaning depends on context

A basic word vector system like the one we’ve discussed doesn’t fully account for an important aspect of natural language: many words have multiple meanings.

Take the word "bank" as an example—it can either refer to a financial institution or to the land beside a river. Or consider these two sentences:

John picks up a magazine.
Susan works for a magazine.

While both sentences involve the word "magazine," the meanings differ subtly. John is referring to a physical magazine, whereas Susan is working for an organization that publishes such magazines.

Words that have two unrelated meanings, like "bank," are called homonyms by linguists. Words with closely related meanings, such as "magazine," are referred to as polysemy.

Large language models like ChatGPT can represent the same word with different vectors, depending on its context. So, there’s one vector for "bank" (financial institution) and another for "bank" (riverbank). Similarly, "magazine" will have different vectors for the physical publication and the organization that publishes it. As you might expect, words with related meanings (polysemy) are represented with more similar vectors than words with unrelated meanings (homonyms).

We haven’t yet explained how language models manage this—more on that shortly. But understanding these vector representations is essential to grasping how LLMs work.

Traditional software operates on clear, unambiguous data. For example, if you ask a computer to compute "2 + 3," there's no confusion about what "2," "+," or "3" mean. But natural language is inherently ambiguous, with complexities that go beyond homonyms and polysemy. Here are a few examples:

In the sentence "The customer asked the mechanic to fix his car," does "his" refer to the customer or the mechanic?
In "The professor urged the student to do her homework," does "her" refer to the professor or the student?
In "Fruit flies like a banana," is "flies" a verb (suggesting fruit soaring through the air) or a noun (referring to insects that enjoy bananas)?

These ambiguities illustrate how language models must carefully account for context when interpreting words.

People resolve ambiguities like these by relying on context, but there are no simple, one-size-fits-all rules for doing so. Instead, it requires understanding various facts about the world. For instance, you need to know that mechanics usually fix customers' cars, that students generally complete their own homework, and that fruit typically doesn’t fly through the air.

Word vectors give language models the flexibility to represent the meaning of each word based on the context of a specific passage. Now, let’s explore how they use these vectors to make predictions.

Transforming word vectors into word predictions

GPT-3, a predecessor to the language models behind ChatGPT, is built with numerous layers, each designed to process sequences of word vectors. Each layer takes in a sequence of vectors—one for every word in the input—and progressively adds more context and understanding to help predict what word should come next.

Let’s examine how this works with a simplified example:

Each layer of a large language model (LLM) is based on a transformer, a neural network architecture first introduced by Google in a groundbreaking 2017 paper.

The model’s input, as shown in the diagram below, is the partial sentence “John wants his bank to cash the.” These words are converted into word2vec-style vectors and fed into the first transformer layer.

The transformer identifies that “wants” and “cash” are both verbs, though they can also function as nouns. While we represent this added context in red text in parentheses for clarity, the model actually stores it by altering the word vectors in ways that are challenging for humans to interpret. These adjusted vectors, known as the hidden state, are then passed on to the next transformer layer in the network.

The second transformer layer adds additional context: it determines that “bank” refers to a financial institution, not the side of a river, and that “his” refers to John. This second transformer produces a new set of hidden state vectors, now enriched with all the contextual information the model has gathered so far.

The diagram above is a simplified representation of a language model and shouldn’t be taken as an exact depiction of how real models function. In reality, large language models (LLMs) like GPT-3 have far more than just two layers—GPT-3's most powerful version, for example, contains 96 layers.

Research shows that the first few layers of these models are primarily focused on understanding the sentence's structure and resolving ambiguities like those we've discussed. The later layers, which are not shown in our diagram for simplicity, work to build a more comprehensive understanding of the entire passage.

For instance, as an LLM processes a short story, it seems to keep track of various details about the characters, such as their sex, age, relationships with other characters, locations, personalities, and goals. While researchers don’t yet fully understand how the model stores and organizes this information, it’s clear that it must do so by adjusting the hidden state vectors as they move through each layer.

A key factor in this process is the massive size of these vectors in modern LLMs. The most advanced version of GPT-3, for example, uses word vectors with 12,288 dimensions—each word is represented by a list of 12,288 numbers. This is 20 times larger than Google’s original word2vec vectors from 2013. These extra dimensions can be thought of as a kind of "scratch space," allowing the model to make notes to itself about the context of each word. As the vectors pass through each layer, the model can revise and refine its understanding of the passage.

Let’s imagine expanding our diagram to depict a 96-layer LLM interpreting a 1,000-word story. By the 60th layer, the vector for the word “John” might contain a note like: “(main character, male, married to Cheryl, cousin of Donald, from Minnesota, currently in Boise, trying to find his missing wallet).” All of these details would be encoded as a list of 12,288 numbers associated with the word "John," or possibly spread across vectors for other words like Cheryl, Donald, Boise, and wallet, depending on how the model organizes and processes the information.

The goal is for the 96th and final layer of the network to produce a hidden state for the last word that contains all the information needed to predict the next word.

Now, let’s dive into the inner workings of each transformer. Each transformer updates the hidden state for every word in the input passage through a two-step process:

Attention Step: Words "look around" to find other words with relevant context and exchange information with them.
Feed-Forward Step: Each word "reflects on" the information it received in the attention step and uses it to predict the next word.

While it’s the network itself that performs these steps, we phrase it this way to highlight that transformers treat words, not entire sentences or passages, as the basic unit of analysis. This approach allows large language models (LLMs) to fully leverage the parallel processing power of modern GPUs and scale to passages containing thousands of words—areas where earlier models struggled.

You can think of the attention mechanism as a matchmaking service for words. Each word creates a checklist (called a query vector) of the characteristics it's looking for, and it also makes a checklist (called a key vector) of its own characteristics. The network compares each key vector to each query vector (via a dot product) to find the best matches. Once a match is found, it transfers information from the word producing the key vector to the word producing the query vector.

For example, in the previous section, we showed a hypothetical transformer determining that, in the partial sentence "John wants his bank to cash the," "his" refers to John. Here's how this might work behind the scenes: The query vector for "his" might indicate, “I’m looking for: a noun describing a male person,” while the key vector for "John" might say, “I am: a noun describing a male person.” The network detects the match between these vectors and transfers information from "John" into the "his" vector.

Each attention layer in a transformer model contains multiple “attention heads,” meaning the information-swapping process happens several times (in parallel) within each layer. Each attention head focuses on a different task:

One attention head might match pronouns to their corresponding nouns, as we discussed earlier.
Another attention head might handle the meaning of homonyms, like "bank."
A third attention head might link together two-word phrases, such as “Joe Biden.”
And so on.

Attention heads often work in sequence, with the results of one attention operation feeding into a subsequent layer’s attention head. For example, tasks like resolving homonyms or linking phrases could require multiple attention heads working together, rather than just one.

The largest version of GPT-3 consists of 96 layers, each with 96 attention heads, resulting in 9,216 attention operations every time it predicts a new word.

A Real-World Example

In the previous sections, we outlined a simplified view of how attention heads function. Now let’s examine how a real language model works. In a study last year, scientists at Redwood Research investigated how GPT-2, an earlier predecessor to ChatGPT, predicted the next word in the passage: “When Mary and John went to the store, John gave a drink to.”

GPT-2 predicted that the next word was “Mary.” The researchers discovered that three types of attention heads contributed to this prediction:

Three heads, named Name Mover Heads, copied information from the Mary vector to the final input vector (for the word “to”). GPT-2 used this vector to predict the next word.
How did the model decide that Mary was the right word to copy? The scientists traced this decision backward through GPT-2’s process and found four attention heads called Subject Inhibition Heads. These marked the second “John” vector, preventing the Name Mover Heads from copying it.
How did the Subject Inhibition Heads know to block John? Further investigation revealed two attention heads, Duplicate Token Heads, which marked the second “John” as a duplicate of the first. This helped the Subject Inhibition Heads decide that John shouldn’t be copied.

In short, these nine attention heads helped GPT-2 recognize that the phrase “John gave a drink to John” doesn’t make sense and chose “John gave a drink to Mary” instead.

This example highlights the challenge of fully understanding LLMs. The five-person Redwood team published a 25-page paper detailing how they identified and validated these attention heads. Despite this effort, we’re still far from a complete explanation for why GPT-2 predicted "Mary" as the next word.

For instance, how did the model determine the next word should be a name and not another type of word? Consider other scenarios where "Mary" wouldn’t be a fitting prediction. In a sentence like, “When Mary and John went to the restaurant, John gave his keys to,” the logical next word might be “the valet,” not a name.

With enough research, computer scientists may uncover more steps in GPT-2’s reasoning. Eventually, they might explain why Mary is the best prediction in this case. However, fully understanding the prediction of a single word could take months or even years.

The language models powering ChatGPT are much larger and more complex than GPT-2. These systems can handle more intricate reasoning tasks than the simple sentence-completion example studied by the Redwood team. As such, providing a complete explanation of how they work will be a massive undertaking, one that is unlikely to be fully accomplished anytime soon.

The Feed-Forward Step

Once the attention heads have exchanged information between word vectors, the feed-forward network steps in to “think about” each word vector and predict the next word. At this stage, no information is exchanged between words. The feed-forward layer processes each word in isolation, but it has access to any information copied by the attention heads earlier. Here’s how the feed-forward layer is structured in the largest version of GPT-3:

The green and purple circles represent neurons, which are mathematical functions that compute a weighted sum of their inputs. This sum is then passed through an activation function. If you're interested in a detailed explanation of how this works, check out our 2018 neural network explainer.

What makes the feed-forward layer powerful is its vast number of connections. In the diagram, we've illustrated a network with three neurons in the output layer and six neurons in the hidden layer. However, GPT-3's feed-forward layers are much larger: the output layer contains 12,288 neurons, corresponding to the 12,288-dimensional word vectors, and the hidden layer has 49,152 neurons.

In the largest version of GPT-3, the hidden layer has 49,152 neurons, each with 12,288 inputs and corresponding weight parameters. The output layer has 12,288 neurons, each with 49,152 input values and weight parameters. This means each feed-forward layer contains 49,152 * 12,288 + 12,288 * 49,152 = 1.2 billion weight parameters. With 96 feed-forward layers, the total becomes 1.2 billion * 96 = 116 billion parameters, which make up almost two-thirds of GPT-3’s 175 billion parameters.

In a 2020 study, researchers from Tel Aviv University discovered that feed-forward layers operate by pattern matching. Each neuron in the hidden layer identifies a specific pattern in the input text. Here are some patterns that neurons in a 16-layer version of GPT-2 matched:

A neuron in layer 1 matched sequences of words ending with “substitutes.”
A neuron in layer 6 matched sequences related to the military, particularly those ending with “base” or “bases.”
A neuron in layer 13 matched sequences ending with a time range, such as “between 3 pm and 7” or “from 7:00 pm Friday until.”
A neuron in layer 16 matched sequences related to television shows, like “the original NBC daytime version, archived” or “time shifting viewing added 57 percent to the episode’s.”

As you can observe, the patterns become more abstract in the later layers. The earlier layers focused on matching specific words, while the later layers identified broader semantic categories, such as television shows or time intervals.

This is intriguing because, as mentioned earlier, the feed-forward layer processes only one word at a time. So when it classifies the sequence “the original NBC daytime version, archived” as related to television, it only has access to the vector for the word "archived," not for words like "NBC" or "daytime." Presumably, the feed-forward layer recognizes that "archived" belongs to a television-related context because attention heads earlier in the process transferred relevant information into the "archived" vector.

When a neuron matches one of these patterns, it enhances the word vector with additional information. While this information is not always easy to interpret, in many cases, it can be seen as a tentative prediction about the next word.

Feed-Forward Networks and Vector Math

Recent research from Brown University provided an elegant example of how feed-forward layers assist in predicting the next word. As we discussed earlier, Google’s word2vec research demonstrated that vector arithmetic could be used to reason by analogy—such as Berlin - Germany + France = Paris.

The Brown researchers discovered that feed-forward layers sometimes use this exact method for word prediction. For instance, they examined how GPT-2 responded to the prompt: “Q: What is the capital of France? A: Paris Q: What is the capital of Poland? A:”

The researchers analyzed a 24-layer version of GPT-2. After each layer, they checked the model's top prediction for the next token. In the first 15 layers, the model's guess was seemingly random. Between the 16th and 19th layers, the model began predicting "Poland"—not the right answer, but getting closer. Then, by the 20th layer, the top guess shifted to "Warsaw"—the correct answer—and remained consistent through the final layers.

The Brown researchers discovered that the 20th feed-forward layer transformed "Poland" into "Warsaw" by adding a vector that maps country vectors to their corresponding capitals. Adding the same vector to "China" resulted in "Beijing."

In addition, feed-forward layers in the same model employed vector arithmetic to convert lowercase words to uppercase and present-tense verbs to their past-tense forms.

Different Roles of Attention and Feed-Forward Layers

We’ve now explored two real-world examples of GPT-2's word prediction: attention heads helping predict that "John gave a drink to Mary" and a feed-forward layer helping predict that "Warsaw" is the capital of Poland.

In the first example, "Mary" was provided in the user’s prompt. However, in the second case, "Warsaw" wasn’t part of the prompt. Instead, GPT-2 had to "remember" that Warsaw is the capital of Poland—an item of information it learned from training data.

When the Brown researchers disabled the feed-forward layer responsible for converting "Poland" to "Warsaw," the model stopped predicting Warsaw as the next word. Interestingly, though, when they added the sentence "The capital of Poland is Warsaw" to the beginning of the prompt, GPT-2 could once again correctly answer the question. This suggests that GPT-2 used attention heads to copy the name "Warsaw" from earlier in the prompt.

This separation of responsibilities is more broadly true: Attention heads retrieve information from previous words in a prompt, while feed-forward layers help the model "remember" information that wasn’t included in the prompt.

In fact, one way to think about feed-forward layers is as a database containing knowledge the model has learned during training. The earlier feed-forward layers tend to store simple facts about specific words, like "Trump often follows Donald." Later layers capture more complex relationships, such as "add this vector to transform a country into its capital."

How Language Models Are Trained

Early machine learning algorithms often required human-labeled training examples. For instance, photos of dogs and cats needed to be tagged with labels like "dog" or "cat." The reliance on humans for labeling data made it challenging and costly to create large datasets needed to train powerful models.

A major innovation of large language models (LLMs) is that they don’t need explicitly labeled data. Instead, they learn by predicting the next word in regular text. Almost any form of written material—whether it’s Wikipedia, news articles, or computer code—can be used to train these models.

For example, an LLM might be given the phrase “I like my coffee with cream and” and predict “sugar” as the next word. A freshly initialized language model would perform poorly at this task, as each of its 175 billion weight parameters (in the case of GPT-3’s largest version) would start with essentially random values.

However, as the model is exposed to hundreds of billions of words, its weights gradually adjust to make better predictions over time.

To explain this more clearly, let’s use an analogy. Imagine you're taking a shower and want to set the water temperature just right—not too hot, not too cold. Since you’ve never used this faucet before, you randomly turn the knob and test the water. If it’s too hot, you adjust the knob one way; if it’s too cold, you turn it in the opposite direction. The closer you get to the ideal temperature, the smaller the adjustments you make.

Now, let’s tweak the analogy. Instead of one faucet, there are 50,257, each corresponding to a different word like "the," "cat," or "bank." Your goal is to adjust the faucets so that only the one corresponding to the next word in a sequence is turned on.

The Training Process: A Maze of Pipes and Valves

In this analogy, imagine that behind each faucet is a maze of interconnected pipes, each equipped with numerous valves. If the wrong faucet dispenses water, you don't simply adjust the knob. Instead, you send an army of intelligent squirrels to trace each pipe back, adjusting the valves they encounter along the way.

This becomes tricky because one pipe often feeds into multiple faucets. So, the squirrels need to think carefully about which valves to tighten, which to loosen, and by how much.

While this analogy may sound outlandish when taken literally—who would design a system with 175 billion valves?—it illustrates how the training process of a language model works at a large scale, thanks to advances in computing power.

In large language models (LLMs), the components we've discussed so far—the neurons in the feed-forward layers and the attention heads that transfer context—are simply chains of mathematical functions (mostly matrix multiplications) with weight parameters that can be adjusted. Just as the squirrels adjust the valves to control water flow, the training algorithm tweaks these weight parameters to manage the flow of information through the neural network.

Training occurs in two phases. First, there’s the "forward pass," where you turn on the water and check if it flows from the correct faucet. Next, during the "backward pass," the squirrels (represented by an algorithm called backpropagation) race through the network to adjust the valves. Backpropagation "walks backward" through the network, using calculus to determine how much each weight needs to be adjusted. (If you want more details about backpropagation, check out our 2018 explainer on neural networks.)

The Training Process and GPT-3's Surprising Performance

Completing a forward pass with a single example, followed by a backward pass to adjust the network’s performance, involves hundreds of billions of mathematical operations. Training a massive model like GPT-3 requires repeating this process across countless examples. OpenAI estimates that training GPT-3 involved more than 300 billion trillion floating-point calculations—equivalent to months of work for numerous high-performance computer chips.

Why GPT-3 Performs So Well

The effectiveness of the training process may seem surprising, given how simple the learning mechanism is. ChatGPT can handle a wide range of tasks—writing essays, making analogies, and even writing computer code. So how does such a straightforward approach produce such a powerful model?

One key factor is scale. The sheer number of examples GPT-3 is exposed to is staggering. It was trained on a dataset of approximately 500 billion words, whereas a typical human child encounters around 100 million words by age 10.

Over the past five years, OpenAI has steadily expanded the size of its language models. In a well-known 2020 paper, OpenAI noted that the accuracy of its models increased in a power-law relationship with model size, dataset size, and the amount of computing power used for training—trends spanning over seven orders of magnitude.

The larger the models became, the better they performed on language-related tasks, but this improvement only occurred when the amount of training data also grew significantly. To train larger models on even more data, vast amounts of computing power were required.

OpenAI’s first large language model, GPT-1, was released in 2018. It used 768-dimensional word vectors, had 12 layers, and contained a total of 117 million parameters. A few months later, GPT-2 was launched, with its largest version featuring 1,600-dimensional word vectors, 48 layers, and a total of 1.5 billion parameters.

In 2020, OpenAI released GPT-3, featuring 12,288-dimensional word vectors and 96 layers, totaling 175 billion parameters. This year, OpenAI introduced GPT-4, although they have not disclosed specific architectural details. However, it is widely believed that GPT-4 is significantly larger than GPT-3. Each model not only learned more facts than its predecessors, but also showed improved performance on tasks requiring abstract reasoning.

Take, for example, the following story:

There is a bag filled with popcorn. The bag's label says "chocolate," but it doesn't contain any chocolate—only popcorn. Sam finds the bag. She has never seen it before and cannot see inside. She reads the label.

You can probably guess that Sam will believe the bag contains chocolate and will be surprised to find popcorn inside. This ability to reason about the mental states of others is called "theory of mind." Most people develop this skill in early childhood, and while experts debate whether non-human animals, like chimpanzees, possess theory of mind, it's widely recognized as crucial for human social interaction.

Earlier this year, Stanford psychologist Michal Kosinski examined LLMs’ ability to tackle theory-of-mind tasks. He presented them with passages similar to the one above and asked them to complete sentences like, "She believes that the bag is full of..." The correct answer is "chocolate," but less advanced language models might say "popcorn" or something else.

GPT-1 and GPT-2 failed this test. However, the initial version of GPT-3, released in 2020, correctly answered almost 40 percent of the time, a performance level comparable to that of a 3-year-old. The most recent GPT-3 model, released last November, improved this to approximately 90 percent, matching the abilities of a 7-year-old. GPT-4 performed even better, correctly answering about 95 percent of theory-of-mind questions.

"Since there is no indication that theory-of-mind (ToM)-like abilities were deliberately designed into these models, nor research showing how to achieve that, it’s likely that ToM-like abilities emerged spontaneously and autonomously as a byproduct of the models’ increasing language capabilities," Kosinski wrote.

It’s important to note that not all researchers agree that these results demonstrate true theory of mind. For example, small adjustments to the false-belief task led to much poorer performance from GPT-3, and its performance on other theory-of-mind tasks has been inconsistent. As one of us (Sean) has suggested, successful performance may be due to confounding factors in the task—something akin to a "Clever Hans" effect, where the model appears to perform well due to unintended patterns rather than genuine understanding.

Nonetheless, GPT-3’s near-human performance on several theory-of-mind tasks is remarkable and would have seemed impossible just a few years ago. It also supports the idea that larger models tend to perform better on tasks requiring high-level reasoning.

This is just one of many examples of language models seemingly developing high-level reasoning capabilities on their own. In April, Microsoft researchers published a paper suggesting that GPT-4 displayed early signs of artificial general intelligence—an ability to think in ways that are sophisticated and human-like.

For instance, one researcher asked GPT-4 to draw a unicorn using the obscure graphics programming language TiKZ. GPT-4 generated a few lines of code, which the researcher then input into the TiKZ software. While the resulting images were rough, they clearly indicated that GPT-4 understood what a unicorn should look like.

The researchers initially considered the possibility that GPT-4 had simply memorized code for drawing a unicorn from its training data. To test this, they presented a follow-up challenge: they modified the unicorn code by removing the horn and adjusting some of the body parts. They then asked GPT-4 to place the horn back in the correct position. GPT-4 responded by accurately positioning the horn.

GPT-4 demonstrated an impressive ability to reason about spatial concepts despite being trained exclusively on text-based data. Without any image-based training inputs, the model somehow developed an understanding of a unicorn's physical form through its extensive textual training.

This capability highlights a profound mystery in artificial intelligence. Researchers are divided on the interpretation of such achievements. One perspective suggests that large language models are developing genuine conceptual understanding, piecing together complex mental representations from textual information. Conversely, another school of thought argues that these models are essentially sophisticated pattern-matching systems—advanced "stochastic parrots" that generate intricate word sequences without true comprehension, merely mimicking understanding through statistical correlations.

The fundamental question remains: Are these AI systems truly grasping meaning, or are they performing an extraordinarily complex form of linguistic pattern recognition?

This philosophical debate surrounding artificial intelligence's understanding of language may ultimately prove irresolvable. However, the focus should remain on empirical outcomes. When a language model consistently provides accurate answers to specific questions—and researchers have rigorously eliminated potential training biases—the result is scientifically significant, regardless of whether the model's comprehension precisely mirrors human understanding.

The effectiveness of next-token prediction training may stem from the inherent predictability of language itself. Language contains regularities that often reflect underlying patterns in the physical world. By learning word relationships, AI models simultaneously gain insights into broader relational structures beyond mere linguistic constructs.

Moreover, prediction might be a fundamental mechanism of intelligence across biological and artificial systems. Philosophers like Andy Clark propose that brains function essentially as "prediction machines" designed to anticipate environmental dynamics and enable successful navigation. This perspective suggests that accurate predictive capabilities require robust, nuanced representations—much like how a precise map enables more effective journey planning. In a complex, unpredictable world, the ability to generate reliable predictions provides a crucial adaptive advantage for organisms and intelligent systems alike.

Historically, developing language models faced a significant challenge: determining the most effective method of representing word meanings. This complexity is compounded by the fact that many words derive their significance from contextual nuances. The next-word prediction approach elegantly circumvents this theoretical quandary by transforming it into an empirical challenge.

By leveraging massive datasets and substantial computational resources, researchers discovered that language models can organically uncover intricate linguistic patterns. These models learn the subtleties of human communication not through explicit programming, but by becoming increasingly adept at predicting subsequent words. The process reveals language's underlying structures almost as a by-product of predictive optimization.

However, this breakthrough comes with a notable caveat: the resulting systems operate as sophisticated "black boxes," with their internal mechanisms remaining largely opaque. While the models demonstrate remarkable linguistic capabilities, their precise decision-making processes remain a mystery, highlighting the tension between impressive performance and true comprehension.

Press contact

Timon Harz

oneboardhq@outlook.com