TImon Harz

December 12, 2024

Run Llama 3.1 on Device with Core ML for Faster AI Processing

Learn how to enhance Llama model performance on Apple devices with cutting-edge optimizations. This post covers the step-by-step process for deploying efficient, privacy-conscious LLMs on macOS using Core ML.

Introduction to Local LLM Deployment on Apple Silicon

Many app developers are keen on creating on-device experiences that integrate increasingly powerful large language models (LLMs). Running these models locally on Apple silicon enables developers to harness the capabilities of the user’s device for cost-effective inference, without sending data to third-party servers, which also helps preserve user privacy. However, these models need to be carefully optimized to make the most of the available system resources, as LLMs tend to have high memory and processing power demands.

Optimizing and Deploying an LLM on Apple Silicon

This technical post explains how to optimize and deploy an LLM to Apple silicon to meet the performance requirements for real-time applications. For this example, we use Llama-3.1-8B-Instruct, a popular mid-sized LLM, and demonstrate how Apple’s Core ML framework, combined with the optimizations described here, can achieve a decoding speed of approximately ~33 tokens/s on a Mac with M1 Max. Although this post focuses on the Llama model, the principles apply to other transformer-based LLMs of various sizes.

Steps to Convert and Optimize the Model

We begin by using the official definition and trained weights of the Llama-3.1-8B-Instruct model hosted on Hugging Face. The post outlines the process of converting the model into the Core ML format using Core ML Tools, optimizing it for on-device inference on a Mac, and benchmarking its performance. For this example, we focus on a Mac with M1 Max and specifically target the GPU, as transformer-based models like Llama-3.1-8B-Instruct are typically constrained by memory bandwidth. The GPU provides the best balance of compute FLOPS and memory bandwidth for our target device.

Baseline Model Export and Performance

We begin by exporting a version of the Llama model with the most basic options (e.g., no KV cache, static input shapes). This allows us to understand the export process, observe how the model generates tokens, and measure its performance using key metrics. This baseline model will also help us identify areas for improvement, which will be addressed in the following sections as we optimize the model’s performance.

Exporting the PyTorch Model into Core ML

To make the model exportable, we define a thin wrapper around the LlamaForCausalLM class. This wrapped model uses fixed input shapes and omits key-value caching (a topic we’ll address in later sections). While this version isn’t optimized for export, it serves as a solid starting point and requires only a small modification to the LlamaForCausalLM module, as shown below:

# !pip install transformers==4.44.2
import torch
from transformers.models.llama.modeling_llama import LlamaForCausalLM

class BaselineLlamaForCausalLM(LlamaForCausalLM):
    """Baseline LlamaForCausalLM model without key/value caching."""
    
    @torch.no_grad()
    def forward(
        self,
        input_ids: torch.LongTensor,
        attention_mask: torch.LongTensor,
    ) -> torch.Tensor:
        out = super().forward(
            input_ids,
            attention_mask,
            use_cache=False,
        )
        return out.logits

model_id: str = "meta-llama/Llama-3.1-8B-Instruct"
torch_model = BaselineLlamaForCausalLM.from_pretrained(model_id).eval()

To export the model, we will first trace the PyTorch model and then use Core ML Tools. Both steps require the input tensor shapes to be specified.

# !pip install coremltools numpy
import coremltools as ct
import numpy as np

batch_size, context_size = 1, 2048
input_shape = (batch_size, context_size)

# trace the PyTorch model
example_inputs: tuple[torch.Tensor] = (
    torch.zeros(input_shape, dtype=torch.int32),
    torch.zeros(input_shape, dtype=torch.int32),
)
traced_model: torch.jit.ScriptModule = torch.jit.trace(
    torch_model,
    example_inputs=example_inputs,
)

# convert to Core ML format
inputs: list[ct.TensorType] = [
    ct.TensorType(shape=input_shape, dtype=np.int32, name="inputIds"),
    ct.TensorType(shape=input_shape, dtype=np.int32, name="attentionMask"),
]

outputs: list[ct.TensorType] = [ct.TensorType(dtype=np.float16, name="logits")]
mlmodel: ct.models.MLModel = ct.convert(
    traced_model,
    inputs=inputs,
    outputs=outputs,
    minimum_deployment_target=ct.target.macOS13,
    skip_model_load=True,
)

Core ML Model Precision and Verification

By default, Core ML generates a model with Float16 precision. For this 8B model, the resulting Core ML model will be around 16GB in size ((BitWidth / 2) x #ModelParameters). We verify that the outputs of the Core ML and PyTorch models (which use Float32 precision) match within a small tolerance.

Model Inputs and Outputs

The context_size refers to the maximum number of tokens the model can process, which we set to 2048 (we will later vary this to assess its impact on performance).

The model has two inputs, both of which have static shapes, meaning the shape remains constant regardless of the input text length:

inputIds <shape=(batch_size, context_size)>: This represents the tokenized text input sequence. Each element is an integer corresponding to a token ID from the model's vocabulary. These tokens are generated by tokenizing the input text using the model's associated tokenizer. The tokens beyond the input text are padded with zeros up to the context size.
attentionMask <shape=(batch_size, context_size)>: This is a binary tensor where 1 indicates the presence of a text token and 0 denotes padding. The model uses this to create a causal mask of shape (1, 1, context_size, context_size) and adds it to the self-attention matrix. This ensures that values in the padded region are ignored and that each token only attends to the previous ones, maintaining the autoregressive nature of the language model.

The model returns one output:

logits <shape=(batch_size, context_size, vocab_size)>: These are raw, unnormalized probability scores for each token in the sequence for every entry in the vocabulary. The vocab_size refers to the total number of unique tokens in the model's vocabulary, which is 128,256 for the Llama 3.1 family of models.

Execution

The model operates in two stages: “prompt” and “extend.” Let’s walk through this using an example where the prompt is “What is generative AI?” This prompt consists of 7 tokens.

Prompt: The first 7 elements of the inputIds are set to the integer tokens generated by the tokenizer from the prompt. The remaining elements are set to zeros. Similarly, the first 7 elements of the attentionMask are set to 1, with the rest set to 0. With these inputs, the model will output logits with the shape (1, context_size, vocab_size). To determine the next token in the sequence, we use a simple greedy sampling strategy, which suffices for benchmarking. This strategy selects the token with the highest probability (in this case, the token at position 7, i.e., argmax(logits[0, 6, :])).
Extend: The selected token is appended to the non-zero elements of inputIds, and 1 is appended to the attentionMask. The model is then invoked again with these updated inputs. As we continue this process, the inputIds and attentionMask will grow by one non-zero value per prediction. The process continues until we either reach the token limit (as specified by the user or the max context size) or the model generates an end-of-sequence token.

With the exported Core ML model, we obtain the following output:

# Float16 Core ML Llama-3.1-8B-Instruct model

Prompt: "What is generative AI?"
Response: "Generative AI is a type of artificial intelligence that uses algorithms to generate new, original content, such as images, music, videos, or text. Unlike traditional AI, which is designed to analyze and process existing data, generative AI creates new data from scratch. This technology has the potential to revolutionize various industries, including art, music, and entertainment..."

Understanding Execution Performance

To assess the model's performance, we calculate two key metrics: prompt latency and extend throughput.

Prompt Latency: This metric evaluates the model's initial responsiveness. It is measured by the time it takes for the model to produce its first token after processing the prompt. We measure this in milliseconds, where lower values are better. This is also known as TTFT (time to first token), which we will use for the remainder of this article.
Extend Throughput: This metric measures the model's efficiency in generating tokens. It is calculated by dividing the total number of tokens generated by the total time taken for generation. Extend throughput reflects how quickly the model can produce a continuous stream of output tokens. We typically allow the model to generate approximately 100 tokens to calculate this metric. This is the main performance metric used to compare different versions of the models and is reported in tokens per second (tokens/s), where higher values are better.

All TTFT and extend throughput numbers presented in this article were measured using the Swift runner with the Core ML framework, utilizing the newly released MLTensor APIs on a Mac with M1 Max running macOS Sequoia 15.2 Beta.

Performance of the Baseline Model

For the baseline model, which uses statically shaped inputs (zero-padded to the maximum context size) and does not implement key-value caching, the following prompt and extend throughputs were obtained with a context size of 2048.

# context_size: 2048
[Prompt]  => 7 tokens, latency (TTFT): 5374.15 ms
[Extend]  => 100 tokens, throughput: 0.19 tokens/s

As shown, the extend throughput is very low, even falling below 1 token/s. This is expected given the model's current construction and execution. There are two main reasons contributing to this slow inference:

(a) The model performs attention computations (i.e., matrix multiplications) for the entire sequence length of context_size (2048 in this case), even though only a small number of tokens actually need to be processed (<=107, for 7 prompt tokens + 100 generated tokens). This is due to the use of statically shaped padded inputs.

(b) For each token produced, there is significant re-computation of values that were already calculated during the processing of previous tokens. The transformer architecture employs an attention mechanism using three tensors: “query,” “key,” and “value,” which grow with each token processed. As explained in the next section, the "key" and "value" tensors for previous tokens can be cached (referred to as KV cache) and only need to be computed for new tokens, updating the cache accordingly. However, this caching does not occur in the baseline model.

In the following sections, we address both of these issues by implementing flexible-shaped inputs and a stateful key-value cache mechanism, which drastically improves performance.

For the baseline model, reducing the context_size during export helps mitigate the effects of (a) and (b), leading to increased throughput, as seen in the table below. However, this comes at the cost of limiting the practical usability of the model, as it restricts the text length window.

Model Optimizations

Now that we've established the baseline, let's explore the key optimizations to improve its performance. In addition to addressing the two issues identified in the previous section, we'll also cover how to implement a more optimized version of attention computation (using the fused SDPA operation) and quantize the model's weights to significantly boost decoding speed. These optimizations will be discussed in the following three sections:

Fused Scaled Dot Product Attention (SDPA)
Key-value cache and flexible-shaped inputs
Block-wise int4 weight quantization

Fused Scaled Dot Product Attention (SDPA)

Transformer models use Scaled Dot-Product Attention (SDPA) within multi-head attention blocks. The SDPA operation computes the attention matrix using the key, value, and query tensors, along with a mask, for all tokens. It then updates the representation that is passed to the next block. This process is computationally intensive, requiring multiple matrix multiplications, softmax operations, and element-wise additions and multiplications on high-dimensional tensors.

Starting with macOS Sequoia, Core ML has introduced the scaled_dot_product_attention as a high-level operation (see Figure 1). This operation is mapped to a single fused GPU kernel, which executes more efficiently. For example, with a fused kernel, the “attention” tensor (the result of multiplying the “key” and “query” tensors), which can be very large ((1, #attn_heads, #token_length, #token_length)), no longer needs to be fully materialized, improving efficiency.

Figure 1: Before macOS 15 Sequoia, SDPA operation is decomposed into several operations. Conversion using Core ML Tools with macOS 15 Sequoia as the target, uses a fused SDPA representation that is accelerated on the GPU.

Optimizing Model Performance

While the Core ML-GPU compiler attempts to automatically detect the pattern and fuse the SDPA operation, using the PyTorch operation torch.nn.functional.scaled_dot_product_attention (already utilized by Hugging Face’s Llama implementation), combined with setting the minimum deployment target to macOS 15+ in the Core ML Tools conversion API, ensures the resulting model includes the fused SDPA operation (see Figure 1).

mlmodel: ct.models.MLModel = ct.convert(
    traced_model,
    inputs=inputs,
    outputs=outputs,
    # Set target to macOS15+ to use the fused SDPA operation.
    minimum_deployment_target=ct.target.macOS15,  
    skip_model_load=True,
)

Key-Value Cache and Flexible Shaped Inputs

The Transformer architecture consists of multiple attention blocks, each generating tensors known as "query," "key," and "value." These tensors are produced for every token the model processes. When a new token arrives, its query projection needs to be processed through the SDPA operation in combination with the key and value projections for all previous tokens. In the baseline model, the key-value projections are recomputed for all the prior tokens with each new token. To optimize this, we introduce a cache for both the keys and values, initialized with zeros.

For the Llama-31-8B-Instruct model, with a context size of 2048, the key and value caches have the shape (32, 1, 8, 2048, 128) due to 32 attention blocks, 8 heads in each block, and 128-dimensional projections. After processing ttokens, the cache updates such that Key[:, :, :, 0:t, :] holds computed values, and the remaining positions are zeros. When the (t+1)-th token is processed, its key/value tensors are computed and appended to the cache, expanding the non-zero values in Key[:, :, :, 0:t+1, :]. This process repeats as each new token is processed.

With the introduction of the key-value cache, the model can now accept flexible-shaped inputs, as follows:

inputIds <shape=(batch_size, [1, context_size])>: During the prompt stage, the inputIds shape will be (1, 7), updating the cache for 7 tokens. In the extend stage, where tokens are processed one by one, the inputIds shape becomes (1, 1), updating the cache with each new token.
causalMask <shape=(batch_size, 1, [1, context_size], [1, context_size])>: Unlike the baseline model, we now directly feed the causal mask to the model instead of using a binary attention mask and computing the causal mask internally. During the prompt stage, the causal mask shape is (1, 1, 7, 7), with values set to -inf in the upper triangular region and 0 elsewhere, encoding causality (preventing tokens from attending to future tokens). In the extend stage, the causal mask is updated for each new token: for the first token, the shape is (1, 1, 1, 8), for the second (1, 1, 1, 9), and so on, with all values set to 0. In the decoding stage, since there are no future tokens to mask, the causal mask is all zeros.

A quick calculation reveals that, compared to the baseline model, the matrix multiplications in the SDPA operation become significantly smaller with these changes in input shape:

Baseline model: The query, key, and value tensors are consistently shaped (1, 32, 2048, 128) during both the prompt and extend stages. The query x key matrix multiplication produces an attention tensor of shape (1, 32, 2048, 2048), resulting in a complexity of O(32*128*2048^2).
Key-value cache and flexible-shaped model:
- During the prompt stage, the query, key, and value tensors are shaped (1, 32, prompt_length, 128), reducing the matrix multiplication complexity to O(32*128*prompt_length^2).
- In the decoding stage, the query tensor is always (1, 32, 1, 128), and the key and value tensors have the shape (1, 32, current_token_id, 128), resulting in a complexity of O(32*128*current_token_id).

The reduction in the number of operations compared to the baseline model is substantial. Before examining the impact on performance metrics, we must determine how to implement the key-value cache. There are various approaches (static, dynamic, etc.), and we will consider a static cache along with two implementation mechanisms, as described in the following sections.

Key-Value Cache as Model I/O (Inputs and Outputs)

In this basic implementation, the model remains "pure," with the key-value cache handled through model inputs and outputs (see Figure 2). Specifically, the pre-allocated cache tensor is provided as an input to the model. During token processing, the model updates the cache tensor and returns it as the output. The driver code then takes this output and uses it as the input for the next iteration (see Figure 2).

Figure 2: Key-value cache implemented as model I/O (inputs and outputs). This is possible to do prior to macOS Sequoia as well. For details, please refer to this video on deploying models with Core ML.

With this approach, the performance results are as follows:

Context Size: 2048

Prompt: 7 tokens, latency (TTFT): 933.89 ms
Extend: 100 tokens, throughput: 1.25 tokens/s

The extend throughput is roughly an order of magnitude faster than the baseline model. However, it still remains quite slow. By examining the performance in relation to the context size, we can gain further insight into the underlying factors:

We observe that performance improves significantly as the context size decreases. In this model, multiple copies of the key/value tensor are made during updates within each attention block, and again when copying it from the output to the next input. Since the size of the key-value cache grows with context size—calculated as 2 (Key/Value) * 2 (#BytesInFP16DataType) * 32 (#Layers) * 8 (#KeyValueHeads) * 128 (AttentionHeadDim) * ContextSize—the larger the context size, the more time is spent on memory copies. With the Llama3.1 8B model, using a context size of 8192 can result in up to ~1GB of data being copied, which incurs a significant overhead. By implementing a stateful key-value cache, we can eliminate these costs. Let's explore how this works next.

Key-Value Cache as State

Starting with macOS Sequoia, Core ML introduced a new input type called "states" (see Figure 3). A prediction can now be stateful, meaning the state tensors are updated at the end of the prediction call without being explicitly returned. Depending on the compute backend and how state tensors are used in the model graph, the compiler can sometimes perform the update “in place.” This reduces the computational overhead of transferring states in and out of the model. This is how we implement the key-value cache for the Llama model using Core ML states.

Context Size: 2048

Prompt: 7 tokens, latency (TTFT): 128.32 ms
Extend: 99 tokens, throughput: 16.26 tokens/s

We now see a ~13x improvement in performance compared to using key-value cache as I/O for a 2048 context size. Additionally, the performance is much more consistent across different context sizes, although some computations still scale with context size, causing a slight monotonic trend. It's important to note that beyond a context size of 2048, the key-value cache grows too large to consistently fit within the GPU cache, leading to more frequent cache misses and a subsequent decrease in decoding speed. This effect doesn’t occur up to a context size of 1024, as the key-value cache remains within the GPU cache limits.

Figure 3: Key-value cache as model state. This is possible with the macOS Sequoia and is more efficient than implementing a cache with model inputs and outputs (Figure 2). For more details, please refer to this video on deploying models with Core ML.

Model Export

Now, let's demonstrate how to implement the stateful key-value cache and flexible input features.

We create our own static cache implementation, which is passed to the transformers API through the SliceUpdateKeyValueCache class, extending the Cache class. This class implements a simple update mechanism using slicing operations. These operations are detected by the Core ML-GPU compiler, enabling in-place updates.

from typing import Any, Optional, Sequence
import torch
from transformers.cache_utils import Cache

class SliceUpdateKeyValueCache(Cache):
    """Helper class for in-place slice updating of key/value caches."""

    def __init__(
        self,
        *,
        shape: Sequence[int],
        dtype: torch.dtype = torch.float32,
    ) -> None:
        """Initialize key/value cache with the specified shape:
        (#layers, batch_size, #kv_heads, context_size, head_dim)."""
        super().__init__()
        self.past_seen_tokens: int = 0
        self.k: torch.Tensor = torch.zeros(shape, dtype=dtype)
        self.v: torch.Tensor = torch.zeros(shape, dtype=dtype)

    def update(
        self,
        k_state: torch.Tensor,
        v_state: torch.Tensor,
        layer_idx: int,
        cache_kwargs: Optional[dict[str, Any]] = None,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """Update key/value cache tensors for slice [begin, end).
        Return slice of key/value cache tensors from [0, end)."""
        position = cache_kwargs.get("cache_position", None)
        assert position is not None, "cache_position required to update cache."
        begin, end = self.past_seen_tokens, self.past_seen_tokens + position.shape[-1]
        self.k[layer_idx, :, : k_state.shape[1], begin:end, :] = k_state
        self.v[layer_idx, :, : v_state.shape[1], begin:end, :] = v_state
        k_state = self.k[layer_idx, :, :, :end, :]
        v_state = self.v[layer_idx, :, :, :end, :]
        return k_state, v_state

    def get_seq_length(self, _: int = 0) -> int:
        """Get the sequence length of the cache."""
        return self.past_seen_tokens

Next, we define a wrapper KVCacheStateLlamaForCausalLM on top of the LlamaForCausalLM class to incorporate this custom cache class. To ensure the Core ML conversion process detects and generates a model with the key-value cache as state inputs, we register these states using PyTorch's register_buffer API.

import torch
from transformers.models.llama.modeling_llama import LlamaConfig, LlamaForCausalLM

class KvCacheStateLlamaForCausalLM(torch.nn.Module):
    """Model wrapper to swap cache implementation and register as buffers."""

    def __init__(
        self,
        model_path: str, *,
        batch_size: int = 1, context_size: int = 4096
    ) -> None:
        super().__init__()
        self.model = LlamaForCausalLM.from_pretrained(model_path)
        config: LlamaConfig = self.model.config
        self.kv_cache_shape: tuple[int, ...] = (
            config.num_hidden_layers,
            batch_size,
            config.num_key_value_heads,
            context_size,
            config.hidden_size // config.num_attention_heads,
        )
        # Register KV cache buffers to be recognized as Core ML states
        self.kv_cache = SliceUpdateKeyValueCache(shape=self.kv_cache_shape)
        self.register_buffer("keyCache", self.kv_cache.k)
        self.register_buffer("valueCache", self.kv_cache.v)

    @torch.no_grad()
    def forward(
        self,
        input_ids: torch.LongTensor,
        causal_mask: torch.Tensor,
    ) -> torch.Tensor:
        # Compute past seen tokens used for updating key/value cache slices
        self.kv_cache.past_seen_tokens = causal_mask.shape[-1] - input_ids.shape[-1]
        return self.model(
            input_ids,
            attention_mask=causal_mask,
            past_key_values=self.kv_cache,
            use_cache=True,
        ).logits

In the export code, we use the coremltools.RangeDim class to define the model inputs with a flexible shape, and the coremltools.StateType class to ensure that kv_cache.k and kv_cache.v are recognized as state inputs. The remaining setup remains unchanged.

import coremltools as ct

batch_size, context_size = 1, 2048

# Initialize and trace PyTorch model
loaded_model: torch.nn.Module = KvCacheStateLlamaForCausalLM(
    model_id, batch_size=batch_size, context_size=context_size
).eval()
example_inputs: tuple[torch.Tensor, ...] = (
    torch.zeros((1, 2), dtype=torch.int32),
    torch.zeros((1, 1, 2, 5), dtype=torch.float32)
)
traced_model: torch.jit.ScriptModule = torch.jit.trace(
    loaded_model.eval(), example_inputs=example_inputs
)

# Convert to Core ML
query_size = ct.RangeDim(lower_bound=1, upper_bound=context_size, default=1)
final_step = ct.RangeDim(lower_bound=1, upper_bound=context_size, default=1)
inputs: list[ct.TensorType] = [
    ct.TensorType(shape=(batch_size, query_size), dtype=np.int32, name="inputIds"),
    ct.TensorType(
        shape=(batch_size, 1, query_size, final_step),
        dtype=np.float16,
        name="causalMask",
    ),
]
states: list[ct.StateType] = [
    ct.StateType(
        wrapped_type=ct.TensorType(shape=loaded_model.kv_cache_shape, dtype=np.float16),
        name="keyCache",
    ),
    ct.StateType(
        wrapped_type=ct.TensorType(shape=loaded_model.kv_cache_shape, dtype=np.float16),
        name="valueCache",
    ),
]
outputs: list[ct.TensorType] = [ct.TensorType(dtype=np.float16, name="logits")]
mlmodel: ct.models.MLModel = ct.convert(
    traced_model,
    inputs=inputs,
    outputs=outputs,
    states=states,
    minimum_deployment_target=ct.target.macOS15,
    skip_model_load=True,
)

To run the model, there's no need to manage the cache as part of the model I/O. The only necessary change is to pass the state when calling the predict function.

kv_cache = mlmodel.make_state()
logits = mlmodel.predict(data=inputs, state=kv_cache)["logits"]

Block-Wise Int4 Quantization

macOS Sequoia introduced several low-bit quantization techniques supported by Core ML, including 4-bit block-wise linear quantization and channel group-wise palettization, to improve model compression and accuracy (see Figure 4). These methods are crucial for optimizing memory usage and performance during on-device inference. For instance, low-bit palettization significantly reduces the model's memory footprint and boosts latency on the neural engine, while block-wise quantization minimizes accuracy loss by applying quantization at a higher granularity, optimized for the GPU. Additional information can be found here.

Figure 4: Weight compression features introduced in macOS Sequoia. For details, check out this WWDC 2024 session video.

To further enhance model performance, we will quantize the model to the Int4 format using block-wise quantization with a block size of 32. We will apply a straightforward data-free Post-Training Quantization (PTQ) method. As our primary focus is evaluating latency and throughput, we will not assess model quality on benchmark datasets commonly used for accuracy evaluation. However, we did observe that the quantized model produces outputs very similar to the Float16 precision model on a few test prompts. Depending on the application and its testing needs, some accuracy loss might occur with PTQ quantization, and calibration or fine-tuning may be necessary. It’s important to note that performing this on the PyTorch model first (e.g., with the coremltools.optimize.torch APIs) before converting the model will not impact performance.

The following code snippet demonstrates how to quantize the Float16 model to Int4 format:

op_config = ct.optimize.coreml.OpLinearQuantizerConfig(
    mode="linear_symmetric",
    dtype="int4",
    granularity="per_block",
    block_size=32,
)
config = ct.optimize.coreml.OptimizationConfig(global_config=op_config)
mlmodel_int4 = ct.optimize.coreml.linear_quantize_weights(
    mlmodel, config=config
)

Int4 Core ML Llama-3.1-8B-Instruct Model

Prompt: "What is generative AI?"

Response: "Generative AI refers to a type of artificial intelligence (AI) that can create new content, such as images, videos, music, or text, based on a given prompt or input. This technology uses machine learning algorithms to generate novel and often surprising outputs that can be used for a variety of applications, including art, design, entertainment, and even education..."

With this change, the extend throughput for the default context size of 2048 improves to approximately 33 tokens per second—twice the speed of the Float16 model. Additionally, the model size reduces from 16 GB to 4.2 GB, a roughly 4x reduction.

# context_size: 2048
[Prompt] => 7 tokens, latency (TTFT): 51.91 ms
[Extend] => 100 tokens, throughput: 33.67 tokens/s

The following table shows the impact of context size on the extend throughput.

Model Export via torch.export and Inference with ExecuTorch

In this article, we utilized the torch.jit.trace method to capture the PyTorch graph, which was then converted to Core ML format using Core ML Tools. The newer torch.export path, currently in beta, may require slight adjustments to the Llama model definition for export. Core ML Tools also supports this torch.export path (in beta mode), either directly or via the ExecuTorch backend.

Llama can be executed on ExecuTorch through the Core ML backend by using torch.export with a custom export path. ExecuTorch, part of the PyTorch ecosystem, is designed for deploying machine learning models on mobile and edge devices, providing a full PyTorch experience. It includes a Core ML backend that uses Core ML Tools for model export and the Core ML framework to run machine learning models efficiently within the ExecuTorch runtime on Apple devices. Furthermore, ExecuTorch supports a custom export path for the Llama family of models. Learn more about getting started with ExecuTorch Core ML backend for Llama model export and deployment on Mac.

Conclusion

By leveraging the Core ML framework and the optimizations discussed in this post, app developers can deploy large language models (LLMs) to run locally on Apple silicon, taking full advantage of the user’s hardware for efficient and cost-effective on-device inference while enhancing user privacy.

This post outlined the process of optimizing the Llama-3.1-8B-Instruct model and deploying it on a Mac with an M1 Max chip running macOS Sequoia, achieving a decoding rate of around 33 tokens per second. Two key optimizations were applied to address bottlenecks in large attention matrix computations and model weight memory: quantization to Int4 to reduce the model size and a stateful key-value cache to minimize data copying and reuse compute resources in each decoding iteration.

The principles shared here are applicable to other transformer-based LLMs, and as LLMs with smaller parameter counts continue to improve, on-device deployment on Apple silicon via Core ML is expected to become even faster and more efficient.

Press contact

Timon Harz

oneboardhq@outlook.com