What is instruction tuning?

Instruction tuning improves the ability of LLMs to follow complex instructions, making them more reliable for real-world tasks. By fine-tuning with instructional datasets, models become better equipped to handle diverse user requests and improve accuracy across tasks.

Instruction tuning is a process used to refine large language models (LLMs) by training them on a dataset of labeled instructional prompts paired with their corresponding outputs. This technique enhances the model’s ability to follow instructions more effectively, improving its performance not just for specific tasks, but in interpreting and executing instructions in general, making pre-trained models more suitable for practical applications.

It is a type of fine-tuning, a broader approach used to adapt pre-trained foundation models to various downstream tasks. Fine-tuning allows these models to be customized for different purposes, such as adjusting style, expanding knowledge, improving task performance, or optimizing for a particular use case. Though fine-tuning can be applied across various AI domains and model architectures, it has become a key step in the LLM lifecycle. For instance, Meta’s Llama 2 models are available in different versions, including a base model, a variant fine-tuned for dialogue (Llama-2-chat), and one fine-tuned for coding (Code Llama).

Instruction tuning can coexist with other fine-tuning methods. For example, chat models often undergo both instruction tuning and reinforcement learning from human feedback (RLHF), which helps improve qualities like helpfulness and honesty. Similarly, models fine-tuned for coding may receive both instruction tuning to improve general instruction-following and additional fine-tuning on programming data to enhance the model’s understanding of coding syntax and terminology.

Instruction tuning is crucial because, like most fine-tuning methods, it addresses the fact that pre-trained large language models (LLMs) are not optimized for following instructions or engaging in conversations. In a basic sense, LLMs don't “answer” prompts as humans would—they generate text based on the input. Instruction tuning improves this by making the generated text more relevant and useful.

The pre-training of autoregressive language models, such as Meta's Llama 2, OpenAI's GPT, Google’s Gemini, or IBM's Granite, primarily optimizes the models to predict the next word(s) in a sequence until the sentence or passage is complete. During pre-training, the model is fed a text sample with a missing word, and its task is to predict the next word in the sequence. The model’s predictions are compared to the actual word in the dataset, and the model is adjusted through optimization techniques like gradient descent to improve accuracy over time. This process helps the model learn linguistic patterns and, in turn, the knowledge conveyed within those patterns.

While this approach enables LLMs to generate coherent and grammatically correct text, it doesn't necessarily make them good at understanding and following specific user instructions. Without further fine-tuning, a base model might generate a grammatically correct but irrelevant response, such as answering "teach me how to bake bread" with "in a home oven," which is not the intended outcome.

Training an LLM specifically for instruction-following from scratch is impractical due to the enormous computational, time, and data requirements involved, given the billions of parameters these models typically have. Instead, fine-tuning an already pre-trained model is a much more efficient approach, requiring far less data and computational power—especially when using techniques like parameter-efficient fine-tuning (PEFT), which minimizes computational demand.

Instruction tuning involves supervised learning with labeled (input, output) pairs. What sets it apart from other supervised fine-tuning methods is that the input samples in the dataset consist of tasks resembling real user prompts, and the outputs provide the desired responses. By adjusting the model’s parameters to match the examples in the dataset, the model learns to produce specific, actionable responses to instructions, such as offering genuine advice on how to bake bread.

Ultimately, instruction tuning bridges the gap between a model’s core function of next-word prediction and the user’s goal of having the model perform specific tasks or follow instructions. This makes the model’s behavior more reliable, useful, and aligned with user expectations.

Instruction tuning enhances large language models (LLMs) by fine-tuning them on a labeled dataset consisting of a variety of instruction-following tasks, which improves their ability to adhere to user instructions across a broad range of tasks. This process reduces the need for detailed in-context information, making it easier to obtain effective responses from the model with less input. Instruction datasets can either be manually created or generated by other LLMs, providing a flexible approach to creating relevant training data.

According to a 2022 Google Research paper, "Finetuned Language Models are Zero-Shot Learners," the aim of instruction tuning is to improve LLMs' ability to respond to natural language processing (NLP) instructions. This is accomplished by combining elements of pretraining-finetuning and prompt engineering. Essentially, instruction tuning integrates prompt engineering techniques into the fine-tuning process, which reduces the amount of manual prompting required to obtain useful, accurate responses from the model.

Each instruction dataset contains three key components:

An instruction: A natural language prompt specifying the task at hand, such as "translate this sentence from English to Spanish."
Additional information: Supplementary context relevant to the task, such as a passage for a reading comprehension task.
Desired output: The model’s target response, which serves as a ground truth for training and evaluating the model's accuracy.

In Google’s study, they tested their LaMDA-PT model, which was fine-tuned and renamed FLAN (Finetuned Language Net), and found that instruction tuning led to significant improvements in tasks like translation, question-answering, reading comprehension, and natural language inference (NLI). The paper also referenced the observation made by the creators of GPT-3 that pre-trained models struggle with tasks like NLI because such tasks rarely appear in their pre-training data. Instruction tuning, however, helped improve these tasks by providing examples that resemble user requests.

An interesting finding from the FLAN study was that adding a variety of tasks to the instruction dataset improved the model’s performance not only on tasks it had seen during training but also on novel tasks that weren’t part of the instruction set. This is the key advantage of instruction tuning: it enhances the model’s ability to follow instructions in general, making it more adaptable and versatile.

The study also compared instruction tuning with multi-task fine-tuning by testing three different fine-tuning setups:

No template: Only input-output pairs were provided, without any instruction.
Dataset name: Each input was labeled with the task and dataset, such as “[Translation: WMT 14 to French].”
FLAN instructions: The input followed instruction tuning principles, such as “Please translate this sentence to French: 'The dog runs.'”

The results showed that instruction-tuned models outperformed those fine-tuned with only input-output pairs or dataset names, achieving a greater accuracy in zero-shot tasks. This demonstrated that instruction-specific fine-tuning is more effective for improving performance on unseen tasks.

Instruction tuning also benefits tasks that require reasoning, such as chain-of-thought (CoT) tasks. CoT prompting encourages LLMs to break down problems step by step, which improves their performance on complex reasoning tasks. Research by Wei and others found that instruction tuning without CoT tasks degrades model performance in these areas. However, adding CoT tasks to the instruction dataset boosts the model's reasoning capabilities, making it more effective at logical and symbolic tasks.

Several open-source datasets are available for instruction tuning LLMs. These datasets include naturally occurring instruction-output pairs, templates for converting existing annotated data into instruction-based formats, and examples generated by other LLMs, offering various ways to train models on specific tasks.

Human-Created Datasets

Creating (instruction, output) pairs manually is a straightforward but time-intensive process, often requiring significant time and financial investment. To alleviate this, several techniques have been developed to convert natural language datasets into instructions, typically by using templates. The availability of various open-source, human-crafted datasets has helped reduce the cost of fine-tuning models on organic data.

Notable open-source human-generated instruction datasets include:

Flan: Originally used to fine-tune Google’s LaMDA-PT model, which produced the first FLAN model, the dataset has since been refined and applied to fine-tune numerous LLMs, including FLAN-T5, Flan-UL2, and Flan-PaLM 540B (also known as FLAN-T5-XXL).
OpenAssistant: OpenAssistant Conversations is a multilingual, human-created dialogue corpus designed for assistant-style exchanges. It includes 91,829 user prompts and 69,614 assistant replies, drawn from 66,497 conversation trees in 35 languages.
Dolly: Dolly is a dataset of 15,000 human-generated conversations, designed to enable LLMs to interact with users in a dialogue-driven style similar to ChatGPT. It covers a wide range of tasks and behaviors, including summarization, information extraction, brainstorming, creative writing, classification, and question-answering.

LLM-Generated Datasets

Given the high cost and effort involved in manually creating instruction datasets, many datasets now rely on larger LLMs to generate prompts, outputs, or both. This approach often benefits smaller models by teaching them to emulate the behavior of larger models in a teacher/learner dynamic.

Self-Instruct: Created using InstructGPT (an instruction-tuned version of GPT-3), this dataset was generated by providing InstructGPT with natural language "seed tasks" to create additional examples, resulting in 52,000 training instructions. This methodology was adapted by Stanford researchers to generate training data for Alpaca, LLaMA's first instruction-tuned variant, which outperformed InstructGPT on benchmarks.
Evol-Instruct: An extension of the Self-Instruct method, Evol-Instruct uses two strategies: evolution (increasing instruction complexity by adding constraints, steps, and input variety) and mutation (changing instructions to broaden topic coverage). This approach was used to fine-tune LLaMA for the WizardLM model.
ShareGPT: ShareGPT.com hosts user-generated conversations with ChatGPT, providing a large dataset for training multi-turn dialogues. Vicuna, a fine-tuned version of LLaMA, utilized 70,000 conversations from ShareGPT for its training.
OpenOrca: OpenOrca is an enhanced version of the Flan dataset used to train Orca. It focuses on optimizing the use of large models to improve smaller LLMs through imitation learning.

As LLM capabilities increase, the effectiveness of LLM-generated datasets also grows. A 2023 paper replicated Alpaca’s fine-tuning process by using GPT-4 to generate instruction data, resulting in a model (LLaMA-GPT4) that significantly outperformed Alpaca on “Helpfulness” scores and closely matched GPT-4 in “Helpfulness,” “Honesty,” and “Harmlessness.”

Challenges and Limitations of Instruction Tuning

Despite significant advances in LLM performance due to instruction tuning, challenges remain, particularly in diversifying instruction datasets and fully understanding the benefits.

A primary challenge in instruction tuning is the creation of high-quality instructions for fine-tuning. The resources required to develop large instruction datasets have resulted in the centralization of instruction in a few open-source datasets, which may limit model diversity. Although larger proprietary LLMs help reduce the cost of instruction generation, they risk reinforcing the biases and limitations of these models across open-source LLMs. This issue is further complicated when proprietary models are used to evaluate the performance of smaller models.

Some researchers argue that fine-tuning smaller models with larger ones may result in the smaller models mimicking the larger models' style rather than improving functionality. A 2023 study indicated that performance improvements from instruction tuning may stem more from recognizing superficial patterns than from genuine advances in logical reasoning. Moreover, some improvements may only be applicable to tasks closely related to the instruction dataset. Researchers like Gudibande suggest that a better approach to improving open-source models would involve addressing the challenge of developing stronger base language models, rather than relying on imitation of proprietary systems.

Press contact

Timon Harz

oneboardhq@outlook.com