LLaMA2 PyTorch from Scratch (WiP)

Article Summary (TL;DR)

This article is a summary of the paper "LLaMA2 from Scratch: An Own Implementation" and provides helpful guidance for the accompanying repository here. The paper is authored by Thomas Ziereis and Christian Bernhard, and supervised by Andrew Luckow in the context of the lecture: Advanced Analytics and Machine Learning (AAML).

This article aims to help you understand the inner workings of the LLaMA2 model, from architecture to inference, focusing on optimizing the model for consumer hardware through quantization and C++/CUDA enhancements. You'll find a concise overview of the project's goals, setup, model implementation, optimization techniques, and practical tips for efficient inference. For comprehensive details, please refer to the paper and the repository.

Overview/Usage

Setup Guide

Install Dependencies

Run the following command to install all required Python packages:

 pip install -r requirements.txt

Download Model Weights

Download the Meta LLaMA2 7B weights from here. We highly recommend downloading the llama2-7b-chat model for superior text generation performance.

Organize Model Files

Create a /bin directory in the project root and move the llama-2-7b-chat directory and tokenizer.model file into the /bin directory.

Export Model Weights

Export the model weights into our binary file format by running:

 python3 export.py --help

Note: This process requires at least ~28GB of peak RAM.

Compile the C/CUDA Library

Compile the library used for quantization and model inference with the following commands:

mkdir build && cd build
cmake .. -DCUDA=ON # with CUDA support
cmake .. # without CUDA
make

Run Text Completion

You can now run text completion using the base model:

 python3 run.py --bin=bin/chat-llama.bin "Richard Feynman was a "

Note: This can be slow. For better performance, quantize the model weights as described below.

Quantization for Improved Performance

Quantization reduces the precision of the model's weights, significantly enhancing performance and allowing the model to run on consumer hardware. The project supports 8-bit and 4-bit quantization. For GPUs, a 4-bit forward pass is implemented to accommodate hardware constraints, enabling the model to run on almost any GPU with around 6GB of VRAM.

LLaMA Architecture Overview

The LLaMA2 model architecture can be easily understood using the PyTorch implementation. Below is a simplified version of the PyTorch model code showcasing the core components of the model. This version is designed to be easy to read and understand, focusing on the structure and flow of the model.

import torch
import torch.nn as nn

class EncoderBlock(nn.Module):
    def __init__(self, args):
        super().__init__()
        self.attention_norm = RMSNorm(args.dim, args.norm_eps)
        self.attention = SelfAttention(args)
        self.ffn_norm = RMSNorm(args.dim, args.norm_eps)
        self.feed_forward = FeedForward(args)

    def forward(self, x, pos):
        normalized = self.attention_norm(x)
        att = self.attention(normalized, pos)
        encoded = x + att

        ffn_normalized = self.ffn_norm(encoded)
        ffn_out = self.feed_forward(ffn_normalized)
        out = encoded + ffn_out
        return out

class Transformer(nn.Module):
    def __init__(self, args):
        super().__init__()
        self.params = args
        self.dim = args.dim
        self.vocab_size = args.vocab_size
        self.tok_embeddings = nn.Embedding(args.vocab_size, args.dim)

        self.layers = nn.ModuleList([EncoderBlock(args) for _ in range(args.n_layers)])
        self.norm = RMSNorm(args.dim, args.norm_eps)
        self.output = nn.Linear(args.dim, args.vocab_size, bias=False)

    def forward(self, token, pos):
        embedding = self.tok_embeddings(torch.tensor([token], dtype=torch.long))[0, :]
        for layer in self.layers:
            embedding = layer(embedding, pos)
        embedding_normalized = self.norm(embedding)
        output = self.output(embedding_normalized).float()
        return output

LLaMA2 Architecture

The PyTorch implementation above provides a clear view of the LLaMA2 architecture. Each EncoderBlock contains an attention mechanism and a feedforward network, both of which are normalized using RMSNorm. These blocks are stacked together in the Transformer class, which processes the input tokens through embedding, sequentially through each layer, and finally through an output linear layer to generate predictions.

This architecture is implemented in C++ in our project for performance optimization, but the PyTorch version is an excellent reference for understanding the model's structure. The corresponding C++ implementation replicates this architecture, ensuring the same logical flow and functionality but with enhanced performance suitable for consumer hardware.

Quantization and Efficient Model Execution

The heart of the LLaMA2 implementation lies in its C++ code, but to optimize for consumer hardware, we use quantization techniques. Here's an overview of how we achieved efficient model execution through quantization, based on the information from the paper and the repository.

Why Quantization?

Quantization is essential for running large models like LLaMA2 on consumer hardware because it reduces the precision of the model weights, significantly enhancing performance. Quantization techniques such as float16, int8, and int4 are used to convert the weights to lower precision formats. This helps in:

Reducing memory footprint
Accelerating computation
Allowing models to run on hardware with limited resources

Custom Quantization Implementation

To have full control over the quantization process, we implemented our own quantization code. This allows for selective precision adjustments where some operations still work with high precision values to maintain accuracy. For example, small parameter matrices might not benefit from quantization, while large matrices do, ensuring a balance between performance and accuracy.

Data Format and Quantization Process

We saved the model weights in our own .bin format to better handle them since the .pth structure of PyTorch is not always controllable. This custom format allows for better management of weight loading and quantization. Quantization processes involve converting weights to lower precision and sometimes dequantizing back to higher precision for specific operations to maintain model accuracy.

By selectively quantizing and dequantizing weights, we ensure that critical computations maintain high precision while still benefiting from the reduced memory and computational requirements of quantization. This approach strikes a balance between performance and accuracy, leveraging the strengths of both high and low precision arithmetic.

For detailed explanations and the reasoning behind our approach, please refer to the paper.

Inference and Text Generation

To utilize the LLaMA2 model, you can run inference and generate text using the provided scripts. Here’s a guide on how to use the model via the command line and an explanation of the different text generation strategies.

Command Line Usage

You can generate text using the run.py script. Here’s an example command:

python3 run.py --bin=bin/llama_q8.bin --max-toks=200 --method=top_p --temperature=0.3 --top_p=0.9 --top_k=5 "Richard Feynman was a "

This command runs the run.py script with the specified parameters, generating text based on the prompt "Richard Feynman was a ".

Text Generation Process The text generation is handled by the generate.py script, which includes the generate_text function. Below is a simplified version of the generate_text function to illustrate the process:

import torch
import os
from qllama import Runtime
from llama_utils import load_tokenizer
import time

tokenizer = load_tokenizer("bin/tokenizer.model")

def generate_text(llama: Runtime, prompt: str, max_toks: int = 30, method: str = 'greedy', temperature: float = 0.1,
                  top_p: float = 0.1, top_k: int = 10) -> str:
    input_tokens = tokenizer.encode(prompt)
    output_tokens = []
    start_time = time.time()

    # Fill KV Cache with input tokens
    first_token = input_tokens[0]
    _ = llama.forward(first_token, len(output_tokens))
    output_tokens.append(first_token)

    for token in input_tokens[1:]:
        _ = llama.forward(token, len(output_tokens))
        output_tokens.append(token)

    # Generate completion tokens
    while len(output_tokens) < max_toks:
        latest_token = output_tokens[-1]
        out = llama.forward(latest_token, len(output_tokens))
        out = torch.tensor(out)

        if method == 'greedy':
            next_token = torch.argmax(out).item()
        elif method == 'top_p':
            next_token = sample_top_p(torch.softmax(out / temperature, dim=-1), top_p)
        elif method == 'top_k':
            next_token = sample_top_k(torch.softmax(out / temperature, dim=-1), top_k)

        if next_token == tokenizer.eos_id():
            break
        output_tokens.append(next_token)

    total_time = time.time() - start_time
    tokens_per_second = len(output_tokens) / total_time

    print(f"Tokens per second: {tokens_per_second:.2f}")

    return tokenizer.decode(output_tokens)

def sample_top_p(probs: torch.Tensor, p: float) -> int:
# ...

Inference Strategies

Greedy: Always selects the token with the highest probability. This method can result in repetitive and less creative outputs.
Top-P (Nucleus Sampling): Selects tokens from the smallest set whose cumulative probability is at least p. This method allows for more diverse and creative outputs compared to greedy sampling.
Top-K: Selects the next token from the top k tokens with the highest probabilities. This method also promotes diversity and creativity in the generated text.

Temperature: Adjusts the randomness of the token selection. Lower temperatures make the model more deterministic (less random), while higher temperatures increase randomness.

Running the Inference

To generate text using the model, run the run.py script with the appropriate arguments. For example:

python3 run.py --bin=bin/llama_q8.bin --max-toks=200 --method=top_p --temperature=0.3 --top_p=0.9 --top_k=5 "Richard Feynman was a "

This command will generate text based on the provided prompt using the specified parameters for sampling method and temperature.

For detailed benchmark values and performance metrics, refer to the paper.

CUDA Implementation

To further accelerate the LLaMA2 model, we implemented the core functionalities using CUDA to leverage GPU compute power. This allows for significantly faster inference times compared to CPU-only implementations.

Benefits of CUDA

Parallel Processing: GPUs can handle thousands of threads simultaneously, making them ideal for the parallel nature of neural network computations.
Speed: CUDA implementations can drastically reduce the time required for inference by offloading heavy computations to the GPU.
Efficiency: Utilizing GPU resources can improve the overall efficiency of the model, making it feasible to run on consumer-grade hardware.

How to Use the CUDA Implementation

To use the CUDA-accelerated version of the model, follow the setup guide provided in the README:

Install Dependencies

Ensure you have CUDA installed on your system. You can verify this by running:

nvcc --version

Then, install the required Python packages:

pip install -r requirements.txt

Compile the CUDA Library

Compile the library used for quantization and model inference with CUDA support:

mkdir build && cd build
cmake .. -DCUDA=ON
make

Running Inference with GPU

When running the run.py script, add the --gpu flag to enable GPU support:

python3 run.py --bin=bin/llama_q8.bin --max-toks=200 --method=top_p --temperature=0.3 --top_p=0.9 --top_k=5 --gpu "Richard Feynman was a "

This command will utilize the GPU for inference, leveraging the CUDA implementation to accelerate the process.

For more detailed instructions and performance benchmarks, refer to the paper.

By following these steps, you can take full advantage of GPU acceleration to achieve faster and more efficient text generation with the LLaMA2 model.